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INTRODUCTION 

One  often  seeks  to  determine  the  poles  of  a  system  by  observing  its 
natural  response  after  cessation  of  input.  Several  examples  can  be  cited: 

(i)  Acoustic  transducers  [1,2]  and  electromagnetic  antennae  [3]  are  lately 
being  tested  by  applying  a  pulse  input,  (ii)  Aircraft  and  other  military 
hardware  are  being  subjected  to  electromagnetic  pulse  (EMP)  tests  [4],  (iii) 
Engines  and  other  cast  metal  objects  are  often  tested  by  direct  mechan¬ 
ical  impact  [5],  (iv)  Rooms  are  excited  acoustically  and  decay  characteristics 
are  recorded  [6],  In  these  and  other  cases  the  response  after  cessation  of 
input  stimulous  may  be  expressed  as 

yt  =  £  <*k  exp(skt)  (1) 


The  <*k's  depend  upon  the  excitation,  location  of  the  sensor  that  extracts 
y^,  and  selection  of  time  origin,  but  the  s-poles,  s^,  are  inherent  char¬ 
acteristics  of  the  system  and  remain  invariant  so  long  as  the  parameters 
of  the  applicable  wave  equation  and  its  post-excitation  boundary  conditions 

are  not  disturbed.  Sampling  at  interval  T,  starting  at  t  ,  gives  a 

o 

sequence  {y(n)}: 


y(n)  *  I  r.z  u^n)  *ot  ~“<n<+fl0 
k  14  * 


(2) 


where  z.*  exp(s  T),  r  *  o  exp(s  t  )  and  the  unit  step  sequence,  u(n),  has 

K  JC  JC  JC  K  O 


£ 


been  used  to  extend  {y(n)>  with  zeroes  for  negative  n.  Taking  Z-transforms 
gives 

Y(z)/z  =  I  rk/(z  -  zk>  (3) 

k 

so  that  {z  }  and  {r  }  are  the  poles  and  residues  of  Y(z)/z.  This  paper 

iC  K 

concerns  the  problem  of  identifying  the  "system  poles"  {z^}  through  observa¬ 
tion  of  {y(n)>  sequences.  Then  the  s-poles  can  be  determined  from 

sfc  =  T-1log  zk  =  (log|zk|  +  j[arg(zk)  +  2n£])/T  (4) 

The  ambiguity  due  to  aliasing  is  expressed  by  £.  If  ir/T  is  known  to  be 
greater  than  the  imaginary  part  of  every  s-pole  then  &  =  0.  Even  when  this 
is  not  true,  if  one  can  repeat  the  identification  process  with  a  slightly 
different  sampling  interval,  say  T  +  dT,  giving  displaced  system  poles 
zk  +  dzfc,  then  since  dzk  =  skzkdT  it  follows  that  sk  =  dzk/(zkdT).  This 
computation  only  has  to  be  accurate  enough  to  resolve  the  uncertainty  in  £. 
Some  of  the  pole  identification  methods  to  be  discussed  here  use  decimated 
subsequences  and  determine  only  the  q^-power  of  each  system  pole.  This 
contributes  an  additional  "decimation  aliasing"  expressed  by  i  in  the 
formula 

l/q  i/q 

Zk  “  ^Zk^  *  |Zk|  exp{j (arg[z^]  +  2ir£)/q}  (5) 

q+1 

But  this  ambiguity  can  be  resolved  in  a  similar  manner.  If  zk  can  also 
be  determined  then  z  *  z^+1/z^  ,  which  should  be  accurate  enough  to 
resolve  arg(zk>. 

In  all  that  follows  we  shall  assume  that  the  system  poles  are  non-zero, 
distinct,  and  K  in  number.  Throughout  this  paper  we  shall  use  the  nota¬ 
tion  K'  *  K  +  1.  Equation  (3)  can  then  be  expressed  with  a  ratio  of 


polynomials , 


3 


Y(z)/z 


*  • 


V'  +  Vl* 


K-l 


..  +  e  z  +  6 
_ 1 _ 0 

+  ...  +  e.z  +  8^ 
1  o 


(6) 


where  the  system  poles  are  the  roots  of  the  denominator  polynomial.  Our 

failure  to  normalize  either  polynomial  enables  us  to  scale  the  coefficients 

of  the  denominator  freely,  with  the  numerator  coefficients  then  being 

uniquely  determined.  Due  to  the  assumptions  above,  9  and  9  and  at  least 

one  of  the  3's  must  be  non-zero,  and  the  polynomials  can  have  no  roots  in 

A  T 

common.  We  shall  define  the  K^xl  vector,  9=(9  ,9  , . . . ,9  )  .  Furthermore 

U  1  K 

0  12  T 

the  symbol  2  will  be  used  to  denote  the  "power  vector"  (z  ,z  ,z  ,...)  where 

z  is  a  generic  complex  variable.  The  number  of  elements  in  the  vector  must 

be  inferred  from  the  context  of  its  use.  If  z  is  explicitly  z^,  the  ith 

system  pole,  then  its  power  vector  will  be  denoted  Z^.  (Other  than  this  Z 

function,  all  vectors  and  matrices  in  this  paper  will  be  real.  Note  also 

that  the  first  element  of  a  vector  will  always  be  denoted  by  a  "0"  sub- 

.  T 

script.)  Using  this  notation  the  pole  polynomial  is  simply  9  and  the 

T 

system  poles  satisfy  9  Z^  =  0.  Clearing  the  denominator  and  taking  inverse 
transforms  in  Eq.  (6)  gives 

9  y(n)  +  9  y (n+1)  +  ...  +  9  y(n+K)  =  B  6(n+l)  +  —  +  B  6 (n+K)  (7) 

Q1  K  0  1 

in  particular,  for  n£  0, 

(y(n),  y(n+l),  ...  ,  y(n+K))  •  9  E  0  (8) 

Since  in  Eq.  (8)  one  can  solve  for  y(n+K)  as  a  linear  combination  of  the 
preceeding  y(i)'s,  the  set  {y(n)}  forms  an  autoregressive  (AR)  sequence. 
Another  interpretation  is  that  when  the  sequence  is  passed  through  a  finite 
impulse  response  (FIR)  filter,  represented  by  Eq.  (8) ,  whose  transfer 


function  has  zeroes  that  coincide  with  the  system  poles,  then  its  output 
will  be  zero  after  enough  time  has  elapsed  to  fill  the  delay  line  with  sig¬ 
nal.  But  the  geometric  interpretation  is  that  the  vector  0,  which  is  an 
invariant  parameter  of  the  system  and  unique  to  within  a  scalar  multiple,  is 
orthogonal  to  any  post-excitation  output  sequence  of  length  K'  regardless 
of  the  system  input  stimulus  and  sensor  location.  A  variety  of  such  subse¬ 
quences  can  be  extracted  from  a  single  long  output  sequence.  But  if  the 
system  cam  be  repeatedly  excited  in  a  variety  of  ways,  if  the  sensor  loca¬ 
tion  can  be  varied,  or  if  multiple  sensors  cam  be  used  to  record  response 
data,  then  an  enormous  amount  of  information  can  be  gotten  on  what  0  is 
orthogonal  to.  This  should  enable  one  to  accurately  determine  the  direction 
of  0,  which  is  all  that  matters  since  its  magnitude  is  arbitrary. 

Unfortunately  the  problem  is  confounded  by  the  presence  of  noise  in 

the  recorded  data  or  in  numerical  computations.  The  0  vector  must  be  deter- 

.  T 

mined  very  precisely  to  ensure  that  the  roots  of  0  Z  accurately  estimate 

the  system  poles.  The  selection  of  T  relative  to  a  given  s-pole's  frequency 

and  decay  time  can  greatly  influence  the  results.  If  T  is  too  short  then 

all  of  the  elements  of  an  output  subsequence  of  length  K'  have  about  the 

same  numeric  value,  so  that  every  subsequence  "points"  approximately  in  the 

T 

direction  represented  by  the  single  vector  (1,1,..., 1)  .  Being  repeatedly 

told  that  0  is  orthogonal  to  this  vector  is  not  much  help.  Consider  the 

case  of  a  system  whose  natural  reverberant  response  is  a  simple  undamped 

(or  very  lightly  damped)  oscillation  at  an  unknown  frequency  o>  ,  so  that 

o 

there  are  two  s-poles,  s,  _  *  ±jw  ,  so  z.  „  =  exp(±jo>  T) .  Then  the  pole 

1 , 2  O  1,2  O 

2  T  T 

polynomial  is  (z  -  2cos(w  T)z  +  1)  so  0  =  (1,  -2cos(co  T)  ,  1)  .  Since 

o  o 

this  vector  is  orthogonal  to  every  subsequence  of  length  3  we  have 

y(n)  -  2cos  (u>  T)y(n+1)  +  y(n+2)  =  0,  so  u>  can  be  determined  from 
O  '  o 


a)  =  T  ''‘arc  cos[  (y(n)+y(n+2)  )/2y(n+l)]  .  Clearly  if  T  is  so  short  that 
o 

y(n)=y (n+l)=y(n+2)  then  the  accuracy  would  be  very  poor.  A  crude  sensitivity 

analysis  of  this  formula  suggests  that  can  most  accurately  be  determined 

when  the  argument  of  the  arc  cosine  is  near  zero,  which  implies  T=i-(2ir/u)o) 

or  some  odd  multiple  thereof.  This  leads  to  the  conjecture  that,  even  for 

systems  with  many  poles,  z^  might  be  most  accurately  estimated  if  T  is  one 

half  the  oscillation  period  of  that  pole,  provided  the  decay  time  of  the 

pole  is  long  enough  that  it  rings  loudly  throughout  the  sampling  process. 

In  any  case  some  control  over  the  sampling  rate  is  obviously  desirable,  but 

this  may  be  achieved  simply  by  using  decimation  subsequences,  assuming  that 

T  is  not  too  long  an  interval  to  begin  with.  Decimation  is  equivalent  to 

multiplying  T  by  an  integer  q  (the  decimation  "epoch")  and  shifting  t  by 

o 

some  integral  multiple  of  T,  and  leads  directly  to  a  modified  version  of 
Eq.  (8)  : 

tot 

(y(n),  y(n+q)  ,  y(n+2q),  ..  ,  y(n+Kq)Pifi  =  0  (9) 

A  T 

where  <p  it(4i  ty)  is  the  vector  of  coefficients  for  a  polynomial 

0  1  K 

T  g 

ip  Z  whose  roots  are  {z^}.  Thus  subsequences  of  length  K'  whose  elements  are 
spaced  q  samples  apart  are  orthogonal  to  ip,  and  one  can  determine  the  system 
poles  from  \p  provided  the  decimation  aliasing  ambiguity  can  be  resolved. 

If  T  was  rather  short  to  begin  with,  then  these  decimation  subsequences  may 
point  in  much  more  varied  and  useful  directions  than  if  the  sequence  had  not 
been  decimated.  Note  that  in  decimating  the  data  one  does  not  need  to  dis¬ 
card  anything,  since  many  subsequences  can  be  staggered  along  the  original 
data  sequence.  In  virtually  all  that  follows,  the  techniques  proposed  for 
estimating  9  can  also  be  used  to  estimate  p  provided  one  uses  decimation 
subsequences. 
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The  pole  identification  problem  has  been  addressed  by  many  investigators, 
and  the  number  of  references  to  it  in  the  literature  is  almost  unbounded. 
Entire  areas  of  control  theory  and  speech  processing  have  been  devoted  to  it, 
but  often  with  the  following  differences:  (1)  The  system  is  assumed  to  be 
persistently  excited,  perhaps  by  a  pulse  train  or  by  noise,  (2)  The  system 
input  is  measured  and  incorporated  into  the  analysis.  However,  much  of  the 
literature  does  pertain  to  our  problem.  Nevertheless,  we  shall  break  with 
tradition  by  presenting  our  results  first  and  then  discussing  their  relation 
to  those  of  other  researchers. 

Finding  0  when  shown  what  it  is  orthogonal  to  is  equivalent  to  solving 
a  homogeneous  matrix  equation  Ax=0,  where  the  rows  of  the  A  matrix  consist 
of  output  data  subsequences.  Because  the  homogeneous  problem  is  lightly 
treated  in  most  textbooks,  a  section  of  our  paper  summarizes  some  of  the  rel¬ 
evant  techniques.  This  is  followed  by  some  autoregression  matrix  terminology, 
theorems,  and  algorithms  developed  by  the  author  especially  for  this  applica¬ 
tion,  although  their  simplicity  suggests  that  they  may  have  been  discovered 
in  some  form  by  mathematicians  long  ago.  With  this  preparation,  various 
approaches  to  solving  the  central  problem  are  discussed.  Connections  with 
the  work  of  Jain  [7]  are  considered  and  a  simpler  and  more  elucidating  path¬ 
way  to  his  results  is  found,  providing  a  generalization  of  the  method.  Some 
topics  from  the  literature  are  discussed  and  several  other  results  are  pre¬ 
sented  before  posing  unanswered  questions. 

Our  treatment  departs  from  most  of  the  literature  in  that  it  stresses 
the  use  of  matrix  methods  and  vector  space  geometry  rather  than  "sequential" 
concepts,  primarily  because  we  permit  the  use  of  several  output  sequences 
derived  from  possibly  different  input  excitations  or  sensors,  and  it  strives 
to  avoid  ad  hoc  and  asymmetrical  treatments  of  the  data  that  have  been  used 


I 
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in  the  past.  In  particular  we  avoid  the  artificial  conversion  of  what  is 
fundamentally  a  homogeneous  problem  into  an  inhomogeneous  one.  We  present 
for  the  first  time  a  rigorous  theoretical  foundation  for  the  popular  prac¬ 
tice  of  using  "extra-wide"  data  matrices,  i.e.  of  pretending  there  are  more 
poles  than  actually  exist. 

SOLVING  THE  HOMOGENEOUS  MATRIX  EQUATION 

For  an  MxN  matrix  A  define  the  nullspace,  $  (A)  ,  and  rowspace,  iJi^-(A)  ,  as 

T  i  N 

{x:  Ax  =  0}  and  {x:  x  =  A  y}.  If  the  rank  of  A  is  K,  then  <|i-l(A)CR  and 

,  t  M  N 

A  )C  R  ,  both  being  subspaces  of  dimension  K.  Also  <{>  (A)CR  ,  and  it  is  a 

subspace  of  dimension  N-K.  Indeed  $  (A)  and  <fr^(A)  are  orthogonal  complements 
N  N 

of  R  ,  so  that  any  xeR  can  be  expressed  as  an  orthogonal  sum,  x  =  xft  +  x-  , 
where  x  is  the  projection  onto  the  rowspace  of  A  and  x-  is  the  projection 

A  A 

onto  the  nullspace.  Given  a  matrix  E  of  the  same  shape  as  A  we  shall  denote 
by  E  and  E-  the  matrices  obtained  by  decomposing  the  rows  of  E  similarly; 

A  A 

thus  E  =  EA  +  E-,  A^  =  A,  and  A^  =  0.  Moreover,  occasionally  we  shall  use 

.  .  T 

the  notation  " [xl  .  "  to  denote  the  unit  vector  in  the  direction  of  x.  A  A 
unit 

T  T 

and  AA  also  have  rank  K  and  <t>  (A  A)  =  <p  (A) .  The  homogeneous  equation  Ax=0 

T 

is  thus  solved  by  any  x  m  $ (A  A) .  If  the  rank  of  A  is  N-l,  then  the  solu¬ 
tion  is  unique  to  within  a  scalar  multiple,  since  4>  (A)  is  of  unit  dimension. 
If  rank (A)  is  N,  then  no  solution  exists. 

If  A  is  a  square  NxN  matrix,  then  Ax  adj (A)  =  det(A)*I,  recalling  that 
adj (A)  is  the  matrix  obtained  by  replacing  each  element  of  A  by  its  cofactor 
and  then  transposing.  In  particular  if  rank (A)  is  N  then  the  matrix  is  non¬ 
singular  and  the  inverse  can  be  defined  as  A  ^  =  adj (A) /det (A) .  However  if 
rank(A)  <N,  i.e.  A  is  singular,  then  det(A)  =  0  so  that  A*  adj (A)  =  0, 


implying  that  every  column  of  adj (A)  solves  the  homogeneous  equation  Ax=0. 


m 
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Unfortunately  adj (A)  is  identically  zero  whenever  rank(A)  <  N-l,  so  a  non¬ 
trivial  solution  is  provided  only  if  rank (A)  is  N-l,  in  which  case  all  the 
columns  of  adj (A)  are  collinear  and  at  least  one  is  non-zero.  Thus  the 
solution  to  Ax=0  can  be  taken  as  any  non-trivial  column  of  adj (A)  or  alter¬ 
natively  as  x  =  adj (A)  x  xq  for  any  Nxl  vector  x^  so  long  as  it  is  chosen 
to  avoid  a  trivial  solution. 

Given  an  MxN  matrix  of  rank  K,  the  traditional  approach  to  solving  Ax=0 
is  to  proceed  as  follows:  First  discard  rows  of  A  so  as  to  form  a  rank¬ 
preserving  KXN  matrix  B;  since  tp  (A)  =  p  (B)  ,  it  follows  that  Bx=0  has  the 
same  solutions.  Now  delete  columns  of  B  to  leave  a  K*K  nonsingular  matrix  B  , 
and  move  the  deleted  columns  to  the  right  side  of  the  equation  together  with 
the  corresponding  elements  of  x  to  produce  the  inhomogeneous  equation 

B,x,  =  -B  x  ,  where  B  is  K* (N-K)  and  x  has  been  separated  into  the  Kxl  and 
1  1  o  o  o 

(N-K)xl  vectors  x,  and  x  .  If,  as  is  usually  the  case,  the  rank  of  A  can  be 
1  o 

preserved  by  simply  deleting  its  rightmost  columns,  then  one  can  simply  par- 

T  T  T 

tition  B  =  [B, ;B  ],  and  x  =  [x,  ;x  ]  to  achieve  the  desired  result.  Now 
1  o  l  o 

x,  =  -B,  "'"B  x  so  that  the  general  solution  is 
1  1  o  o 


x  = 


0 


x 


0 


(10) 


where  I  is  (N-K)  x  (N-K)  and  xq  is  an  arbitrary  (N-K)  x  l  vector,  thus  giv¬ 
ing  a  solution  subspace  of  dimension  N-K.  If  something  more  complicated 
than  a  simple  partitioning  is  used  to  form  the  matrices  B^  and  Bq,  then  Eq. 
(10)  is  still  valid  with  the  rows  of  C  interchanged  appropriately. 

This  traditional  method  is  fine  if  the  data  (i.e.  A)  are  noiseless  and 


computations  are  exact.  Otherwise  it  is  worthwhile  to  search  for  better 
alternatives.  In  particular  if  M  >>  K  then  it  is  wasteful  to  discard  so 
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many  rows  to  get  B,  since  each  row  provides  some  information  on  what  0  is 

T 

orthogonal  to.  It  would  be  better  to  solve  Px=0,  where  P  =  A  A  so  that 
$  (P)  =  <f>  (A) ;  then  only  a  few  rows  of  P  will  have  to  be  discarded  to  get  B, 
and  every  element  of  A  contributes  something  to  the  solution.  (An  alterna¬ 
tive  would  be  to  fabricate  rows  of  B  as  averages  of  those  of  A,  perhaps  aver¬ 
aging  out  some  of  the  noise  in  the  process.  But  this  could  as  easily  average 
out  some  of  the  signal,  and  it  would  be  hard  to  guarantee  preservation  of 
rank.  This  alternative  will  not  be  explored  further.)  Typically  there  are 
many  rank-preserving  ways  to  discard  rows  of  P,  and  to  choose  which  columns 
to  transfer  to  the  other  side  of  the  equation.  These  choices  cure  arbitrary 
and  may  influence  the  accuracy  of  the  result;  indeed  they  control  the  degree 
to  which  each  of  the  elements  of  A  contribute  to  the  final  result.  This 
arbitrary  and  asymmetrical  use  of  data  is  bothersome  and  might  produce  statis 
tical  bias. 

In  the  special  case  where  rank  (A)  is  N-l,  the  adjoint  solution  can  post 

T 

pone  or  remove  the  arbitrary  asymmetry.  The  direct  solution  to  A  Ax=0  is 
T 

just  x  =  ad j (A  A)x  ,  where  x  is  an  arbitrary  Nxl  vector.  Even  the  arbitrar- 
o  o 

.  .  A  T 

mess  of  xq  can  be  symmetrically  removed.  Since  Q=  ad j (A  A)  has  collinear 

2  th 

columns  and  is  symmetric,  its  elements  obey  q„  =  q^q^  and  the  j  column 

can  be  expressed  as  |qj;.|  x  [  (±)  (±)  2  ^  ^22 1  ^ '  ***  ' 

where  (±)^  *  sign  (q^).  Thus  the  solution  can  be  expressed  formally  as 
T  T 

x  =  [A,,  A„,  ...  ,  A  1  ,  where  A.  is  the  square  root  of  the  magnitude  of 
l  2  N  l 

T 

the  itn  diagonal  cofactor  of  A  A,  with  the  proper  sign  affixed.  The  signs 

T 

can  be  determined  by  examining  any  column  of  adj (A  A) . 

The  adjoint  solution  enables  the  development  of  an  unbiased  estimate, 
based  on  the  following  theorem: 


S  1  THEOREM:  Let  A^ ,  Aj  be  statistically  independent,  random  matrices, 
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T 

whose  elements  are  also  statistically  independent.  Then  £{adj(A^A2)}  = 

adj (£{a^}£{A2}) ,  where  g  is  the  expectation  operator.  O 

Thus  if  A^  and  A2  are  separate  measurements  of  the  same  A  corrupted  by 

additive,  zero-mean  noise  that  is  independent  from  element  to  element,  then 
-  T 

the  estimate  x  =  adj(A_A„)x  is  an  unbiased  estimate  of  a  true  solution  x. 

12  O 

The  singular  value  decomposition  (SVD)  can  always  be  used  to  obtain  a 

solution;  indeed  its  applicability  overlaps  that  of  the  adjoint  solution. 

T  T 

The  SVD  of  an  M*N  matrix  A  is  expressed  by  U  AV=S  and  A=USV  where  S  is  an 

M*N  diagonal  matrix  of  elements  s^  >_  >_  . . .  >_  sK+^  =  sK+2  = _ =  s^  =  0 , 

where  rank  (A)  is  K.  The  "singular  values"  s^  are  the  square  roots  of 

T 

eigenvalues  of  the  non-negative  definite  matrix  A  A,  whose  eigenvectors  also 

T 

form  the  columns  of  the  N*N  orthogonal  matrix  V.  The  eigenvectors  of  AA 

T  T 

form  U.  (Incidentally  A  A  and  AA  agree  as  to  their  non-zero  eigenvalues.) 

till 

The  notation  sv^(A)  will  denote  the  i  singular  value  of  any  matrix  A. 

The  singular  values  can  be  used  to  bound  | |Ax| |  for  any  x  as  follows  [8]: 

| |x| |  x  sv1(A)  >  | |ax| |  >  | ]x| |  x  svN(A)  and  | |AxJ |  >  | |xft| |  x  svR(A) 

where  rank  (A)  is  K.  Clearly  | )a| |  =  sv^ (A) .  (Note:  | |x| |  denotes  the 

Euclidean  norm  of  the  vector,  and  | |a| |  is  defined  as  the  maximum  of  | |ax| | 

for  all  unit  vectors  x.)  The  U  and  V  matrices  are  not  quite  unique;  the 

direction  of  any  column  vector  can  be  reversed,  and  corresponding  to  a 

multiple  eigenvalue  (including  the  zero  eigenvalues)  any  orthonormal  basis 

for  the  eigenspace  can  be  used.  The  homogeneous  equation  Ax=0  now  takes 
T  A  T 

the  form  USV  x*0  or  simply  Sy»0  where  y  s  V  x,  and  we  have  premultiplied 
T  -1 

the  equation  by  U  *U  .  But  due  to  the  diagonal  form  of  S,  the  solution  is 

T  T 

immediate:  y  *  [0,0, ..0;y  ]  where  y  is  an  arbitrary  (N-K)xi  vector. 

o  o 

Thus  x  =  Vy  *  V  y  ,  where  V  consists  of  the  rightmost  N-K  columns  of  V. 
o  o  o 


But  these  columns  are  just  the  basis  vectors  for  the  eigenspace  of  the  zero 
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eigenvalue  of  A  A,  i.e.  basis  vectors  for  <p  (A)  ,  so  it  seems  we  have  only 
reiterated  that  Ax=0  implies  xe4>  (A) .  But  the  superiority  of  the  SVD  method 
emanates  from  the  fact  that,  due  to  noise  or  computing  inaccuracy,  A  A  will 
not  be  exactly  singular;  the  SVD  method  will  reveal  the  extremely  small 
eigenvalues  that  arise,  and  the  associated  columns  of  V  should  approximately 
span  $ (A) .  Furthermore  the  SVD  can  be  achieved  by  a  very  orderly  procedure. 

An  extremely  well  documented  algorithm  and  FORTRAN  program  appear  in  the 
textbook  of  Lawson  and  Hanson  [9]  under  the  label  SVDRS.  (Note:  In  their 
notation  one  should  set  BA=0  so  that  array  B  is  not  referenced.) 

when  the  diagonal  matrix  S  resulting  from  the  SVD  algorithm  is  replaced 
by  S,  wherein  all  but  the  first  k  elements  (i.e.  the  k  largest)  have  been 
forced  to  zero,  and  then  used  to  compute  A=USV  ,  the  result  is  an  optimum 
approximant  to  A  of  rank  k,  denoted  A(k;SVD).  It  has  been  shown  [9]  to  be 
closest  as  measured  by  either  the  ordinary  matrix  norm,  [ | A— A [ | ,  or  the 
Frobenius  norm,  ||a-a|  |p,  where  |  |a|  |p  k  (ZU^)2)*5.  Thus  if  A  has  been 
corrupted  by  noise  then  a  logical  estimate  of  the  nullspace  is  $ (a)  =  <j>  (A(K;SVD) ) , 
where  K  is  the  "true"  rank  A  would  have  if  it  were  uncorrupted  by  noise. 

This  estimate  of  the  nullspace  thus  provides  a  general  solution  to  Ax=0. 

Using  the  approximant  to  solve  nonhomogeneous  equations  is  a  common  practice. 

The  justification  for  our  approach  to  solving  the  homogeneous  equation  is 
strengthened  by  the  following  theorem. 

§2  THEOREM:  If  A=A+E ,  where  nothing  is  known  of  the  M*N  matrix  A  except  that 
it  is  of  rank  K,  and  the  elements  of  E  are  zero-mean,  independent,  Gaussian 
random  variables  having  equal  variance,  then  A(K;SVD)  and  its  nullspace  are 
maximum  likelihood  estimates  of  A  and  (A) ,  respectively.  Moreover,  if  it 
is  known  only  that  A  is  singular,  with  rank (A)  unknown,  then  the  maximum 
likelihood  estimate  is  obtained  with  K-N-l.  If  on  the  other  hand  K  is 


random  with  known  probability  distribution  PK»  then  choose  K  to  minimize 

| | A-A(K;SVD) | | 2  -  2a2  log  P  ,  where  a2  is  the  noise  variance.  CD 
F  K 

If  only  a  single  solution  to  Ax=0  is  desired  rather  than  the  general 

(nullspace)  solution,  then  one  can  simply  point  x  in  the  direction  of  the 

rightmost  column  vector  of  V,  since  it  is  the  basis  vector  of  $ (A)  that  has 

possibly  been  perturbed  the  least.  But  that  is  the  same  as  picking  x  to  be 

-T~ 

the  eigenvector  corresponding  to  the  smallest  eigenvalue  of  A  A;  i.e.  picking 
x  to  minimize  (xTATAx)/| |x| | 2  or  | |ax| |/| |x| | .  When  normalized  to  | |x| |=1 
we  shall  denote  this  estimate  by  x[norm],  where  "norm"  stands  for  both  "normal¬ 
ized"  and  "norm-minimizing" ,  but  it  is  only  unique  to  within  a  direction 

reversal.  Since  the  addition  of  noise  can  be  expected  to  destroy  all  singu- 

~T~ 

larity  of  A,  and  even  eliminate  any  multiple  eigenvalues,  then  A  A  will  be 

-T~  -1 

nonsingular  and  its  smallest  eigenvalue  is  the  largest  of  (A  A)  .  Matrix 

-T~  “I  A 

iteration,  x^+1  *  [ (A  A)  xilvmit'  converge  to  xfnorm].  However  since 

~T~ 

A  A  is  nonsingular  only  because  of  noise,  it  may  be  poorly  conditioned  and 
difficult  to  invert  accurately,  so  the  method  must  be  used  cautiously.  Of 

course  direct  matrix  inversion  can  be  avoided  by  numerically  solving  the 

-In¬ 
equation  A  Ax^+j=  x^  with  subsequent  normalization  for  each  iteration.  It  xs 

-T-  -1  -T- 

interesting  to  note  that  since  (A  A)  =  A* adj (A  A) ,  where  the  scalar  A  is 

just  the  determinant,  the  adjoint  solution  discussed  previously  may  be 

* 

regarded  as  one  step  in  the  matrix  iteration  process  (except  for  the  trivial¬ 
ities  of  normalization  and  possible  direction  reversal)  from  an  arbitrary 

starting  point  x^.  Indeed  an  "improved"  adjoint  solution  may  be  put  forth 
-T-  l 

as  [adj (A  A)]  xQ  for  any  integer  l.  In  the  absence  of  noise  it  will  still 

* 

A  connection  between  the  adjoint  solution  and  the  traditional  least- 
squares  solution  of  the  inhomogeneous  equation  will  be  given  in  a  later 
section. 
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give  the  same  solution  when  rank (A)  is  N-l,  and  in  the  presence  of  noise  it 
tends  toward  the  direction  of  x[norm] .  Moreover  by  using  several  separate 
measurements  of  A  in  conformance  with  Theorem  §1  one  can  improve  the  unbiased 
estimate  in  a  similar  manner. 

When  A  is  corrupted  by  noise  one  expects  that  x  will  depart  from  <J>  (A)  . 

The  following  theorem  provides  an  upper  bound  for  that  error. 

§  3  THEOREM:  If  the  unit  vector  x  is  an  estimated  solution  to  Ax=0  constructed 
to  be  within  the  nullspace  of  A(K;SVD)  where  A=A+E  and  rank (A)  =  K,  then  the 
error  x  is  bounded  according  to  | |x  | [<b  ,b  ,b  ,b  ;  and  if  | | E | | «sv  (A) 
then  the  denominator  of  each  bound  is  approximately  sv  (A) ,  where  b  =  {| |ax| | 

+  | |E| |}/svr(A)  ;  b2  =  2  ||e||/svk(A)  ;  b3  =  {|[a£||  +  I lEAl I >/sv^(A+E^)  ; 
b.  =  2 1  I E—  I  l/svfA+El .  Moreover  b.  <  b_  <  2b,  and  b,  <  b.  <  2b. .  P] 

Discussion:  Note  that  the  theorem  includes  x  =  x[norm]  as  a  special 
case,  x-  is  the  "correct"  portion  of  x  (i.e.  the  component  actually  in  <p  (A) ) 

A 

and  x  is  the  error.  Since  x  is  a  unit  vector,  | |x  | |  is  the  sine  of  the 

A  A 

a 

angle  between  x  and  <j>  (A) .  (The  angle  between  two  unit  vectors  is  the  arc 
cosine  of  their  inner  product;  the  angle  between  a  vector  and  a  subspace  is 
then  defined  as  the  smallest  such  angle,  always  taken  positive.)  Note  that 
| | E | |  is  the  square  root  of  the  largest  eigenvalue  of  Jk,  with  a  similar  result 
holding  for  E-.  Also  (svK(A))is  the  smallest  non-zero  eigenvalue  of  >. 

Clearly  if  |  |e|  |«sv  (A)  then  b  <<1  so  the  error  component  is  small,  but  the 
b4  bound  shows  that  error  occurs  only  when  some  portion  of  E  "complies"  with 
the  rowspace  of  A.  Observation  of  a  single  noisy  matrix  A  will  tell  us 
little  of  the  error,  E;  thus  knowledge  of  | |e| |  or  MEAI I  roust  be  obtained 
a  priori,  e.g.  in  statistical  form.  The  singular  values  of  A  can  be  used 
a  posteriori  to  approximate  the  denominators  of  the  bound  s.  Since  bounds 
b2  and  b^  are  almost  as  tight  as  b^  and  b^,  use  of  [ |ax| |  a  posteriori  does 
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little  to  refine  our  error  estimate.  The  distinction  between  | |ax| |  and 
| |ac  | |  is  important:  The  "residual  error",  i.e.  | |ax| |  =  sv^ (A) ,  measures 
the  degree  to  which  the  noisy  matrix  A  refuses  to  admit  a  homogeneous  solu¬ 
tion.  On  the  other  hand  | |xA| |  measures  the  actual  error  in  the  estimate  x. 
Although  scaling  a  row  of  A  does  not  affect  the  solution  of  the  ideal  equa¬ 
tion  Ax=0,  scaling  of  a  row  of  A  may  influence  the  error  in  x[norm]  since  it 

can  affect  sv  (A) . 

K 

A  —A 

Even  when  x  is  truly  a  solution  of  Ax=0 ,  the  "residual"  vector  Ax  cannot 
be  expected  to  be  a  null  vector.  In  some  problems  the  statistics  of  the 
residual  vector  may  he  known,  at  least  approximately.  Indeed  suppose  that 
it  is  zero-mean  with  positive  definite  covariance  matrix  A.  Then  instead  of 
choosing  x  to  minimize  | |ax| |  it  is  more  equitable  to  minimize  the  weighted  norm 

| | Ax  I [  -  =  (x  A  A  Ax)  .  We  shall  denote  by  x [norm: A  ]  the  unit  vector  that 

A  ,t  -i- 

minimizes  this  norm.  It  is  the  weakest  eigenvector  of  A  A  a.  The  SVD  method 

can  also  be  modified  to  incorporate  this  covariance  weighting.  One  simply 

_i»_  _  .  A  -h 

does  the  SVD  of  A  A  instead  of  A.  The  matrices  A  and  A  =  A  A  have  the 

w 

same  nullspace  and  A  (k,-SVD)  is  the  best  k-rank  approximant  to  A  ,  minimizing 

w  w 

|  |  A  —A  ||  =  |  |  A-A+1,A  ||  .  So  A 15 A  is  the  best  weighted  approximant  to  A. 

W  W  W  1  “A  W 

^  jj* 

Therefore  the  corresponding  weighted  estimate  of  $  (A)  is  <f>  (A  A^)  =  <f>  (Aw) . 

All  of  the  results  given  in  this  section  hold  when  the  weighted  norm  is  sub¬ 
stituted  for  the  ordinary  norm,  with  appropriate  interpretations  and  adjust¬ 
ments. 

AUTOREGRESSION  MATRICES,  GENERATORS,  AND  SUBSPACES 

§  4  DEFINITIONS:  For  any  K'*l  vector  9,  the  "polynomial  produced  by  9"  will 
T 

refer  to  9  Z,  where  Z  is  the  complex  power  vector  defined  previously.  Any 


vector  0  is  said  to  be  a  "generator"  if  its  first  and  last  elements  are  both 
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T 

non-zero  and  the  roots  of  the  "generator  polynomial"  0  Z  are  distinct.  Gen¬ 
erators  that  differ  by  a  scalar  multiple  are  considered  equivalent.  (Note: 
the  interpretation  of  the  elements  of  0  as  coefficients  of  the  system  pole 
polynomial  will  be  ignored  until  the  next  section.)  Given  any  generator  0 
and  a  positive  integer  l  define  the  £*N  matrix  G (£;0),  where  N  =  K'-l+Jl  = 
K+£,  as: 


G(£;0) 


0_,0, - - ,0V,O,O,...  ,0 

U  X  lv 

O,0Q,01, - ,0k'°' - '° 


(ID 


0,0,., 


•0'60-ei’---V 


til 

Now  define  the  "£  autoregression  nullspace  generated  by  0",  denoted 

ft^O),  as  the  rowspace  of  G.  Then  it  is  easy  to  prove  the  following  theorems 

N 

§  5  THEOREM:  Rank(G)  is  £,  so  that  fi^O)  is  a  subspace  of  R  having  dimen¬ 
sion  £,  where  N«K+£ .  Moreover  the  generator  of  the  subspace  ft^  is  unique  to 
within  a  scalar  multiple.  d 

§  6  THEOREM:  For  every  Nxl  vector  x  in  ft  (0),  the  roots  of  the  polynomial 

T  T 

x  Z  constitute  a  superset  of  the  roots  of  the  generator  polynomial  0  Z.  The 

"extraneous  roots"  are  2,-1  in  number,  and  any  complex  extraneous  roots  must 

occur  in  complex  conjugate  pairs.  Indeed  an  x  in  ft^  can  be  found  to  produce 

any  specified  set  of  extraneous  roots.  d 

N 

I  7  THEOREM:  The  subspace  ft^(0)  can  also  be  defined  as  that  subspace  of  R 
which  is  orthogonal  to  each  of  the  complex  power  vectors  Z^,  of  the  K  roots 
of  the  generator  polynomial  z^,  with  N=K+£ .  d 

§  8  DEFINITION:  A  sequence  {y(n)}  is  said  to  be  "9-autoregressive"  (0-AR) , 
where  9  is  K'*l,  if  every  subsequence  of  length  K'  forms  a  vector  orthogonal 
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to  0.  This  implies  that  for  n  >_  K,  y(n)  =  0^  x  (0K_^y(n-l)  +  0R_2y(n-2)  + 

....  +  0Qy(n-K)).  Thus  the  first  K  members  of  the  sequence  are  arbitrary  and 
will  be  termed  the  "seed",  while  the  remaining  members  are  determined  by  a 
recursion  formula.  Clearly  if  any  leading  group  of  members  is  deleted,  the 
remaining  sequence  is  still  0-AR. 

§  9  THEOREM:  The  elements  of  an  N*1  vector  y  form  a  0-AR  sequence  if  and 
only  if  y±fi^(0)  where  l  =  N-K  =  N-(K’-l)  and  0  is  K'xl.  C 
§  10  DEFINITION:  An  M*N  matrix  A  is  said  to  be  0-AR  if  each  row  constitutes 
a  0-AR  sequence.  Clearly  the  definition  makes  no  sense  unless  N>K* .  The 
matrix  is  termed  "minimal-width”  if  N=K*  and  "extra-wide"  if  N>K' .  If  rank (A) 
is  K(— K'— 1) ,  which  according  to  the  next  therem  is  the  most  it  can  be,  then 
the  0-AR  matrix  A  is  said  to  be  of  "sufficient  rank". 

§  11  THEOREM:  If  the  M*N  matrix  A  is  0-AR  where  0  is  K'xi,  then  rank(A)  £ 
K=K'-1.  Further,  if  A  is  of  sufficient  rank  then  <f>  (A)  5  (2^(0)  where  i.=N-K, 
and  any  solution  of  Ax=0  produces  a  polynomial  of  degree  N-l  whose  roots  con¬ 
stitute  a  superset  of  the  roots  of  the  generator  polynomial.  D 
5  12  MORE  DEFINITIONS  AND  DISCUSSION:  Clearly  a  0-AR  matrix  is  the  continua¬ 
tion  eastward  of  its  "seed  submatrix"  consisting  of  the  leftmost  K  columns. 

.  T 

If  it  happens  that  A  is  0-AR  as  well,  then  the  seed  submatrix  may  instead 
be  regarded  as  the  Rxk  matrix  in  the  northwest  corner,  since  from  it  the  entire 
matrix  can  be  propagated  autoregressively.  A  sufficient  (but  not  necessary) 
condition  that  a  square  seed  matrix  can  propagate  such  a  matrix  is  that  the 
seed  be  symmetric.  A  special  case  is  the  "Hankel"  0-AR  matrix,  so  named 
because  the  Mxn  matrix  A  is  formed  from  a  single  0-AR  sequence  {h(n)}  of 
length  M+N+l  as  A^  *  h(i+j-2).  The  Hankel  matrix  has  an  interesting  prop¬ 
erty:  For  any  Nxl  vector  x,  Ax  *  G(M;x)h  where  h  is  the  vector  whose  elements 
constitute  {h(n)}.  This  property  is  just  a  consequence  of  the  fact  that 
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discrete  convolution  is  commutative.  The  numbering  pattern  for  a  8x6  Hankel 

matrix  is  shown  in  Figure  1.  For  this  case  there  is  no  need  to  define  the 

kernel  as  a  submatrix:  the  kernel  simply  consists  of  the  first  K  matrix 

elements  encountered  along  either  edge  starting  from  the  northwest  corner,  and 

the  entire  matrix  is  propogated  from  it  via  the  0-AR  property.  Any  submatrix 

T 

of  contiguous  rows  and  columns  remains  a  Hankel  0-AR  matrix,  and  A  and  A  are 
both  0-AR. 

The  central  problem  considered  in  this  paper  may  be  stated  succinctly  as 
follows:  Given  a  0-AR  matrix  A  or  a  noisy  version  thereof,  A,  where  the  K'xi 
generator  0  is  unknown,  find  the  roots  of  the  generator  polynomial.  Finding 
0  is  normally  an  intermediate  step,  albeit  one  we  would  prefer  to  bypass  since 
the  roots  may  be  poorly  conditioned  with  respect  to  the  polynomial  coefficients. 
Various  paths  to  the  solution  are  possible,  given  a  matrix  A  of  sufficient 
rank  and  at  least  minimal-width.  From  Theorem  §11  <J>  (A)  =  n^(0)  ,  and  any  x 
solving  Ax=0  produces  a  polynomial  whose  roots  include  the  desired  roots  of 
the  generator  polynomial.  If  the  extraneous  roots  can  somehow  be  identified 
then  the  problem  is  solved.  If  instead  we  find  l  linearly  independent  solu¬ 
tions  of  Ax=0,  i.e.  a  basis  for  fi  (9) ,  then  9  can  be  determined  by  a  process 

X* 

described  below  without  having  to  identify  the  extraneous  poles.  If  a  noisy 
matrix  A  is  used  then  the  method  used  to  solve  Ax=0  becomes  significant.  The 
estimate  x[norm]  can  be  found  as  the  weakest  eigenvector  of  A  A  and  used  to 
produce  a  polynomial  whose  extraneous  roots  may  now  be  regarded  as  "noise 
poles".  The  more  traditional  methods  of  solving  the  equation  may  also  be  used; 
they  may  entail  less  computation  but  perhaps  more  error.  If  the  A  matrix  is 
of  minimal  width  then  any  solution  of  Ax*0  is  collinear  with  9  and  there  are 
no  extraneous  roots.  However  the  roots  may  be  perturbed  greatly  by  noise  in  A.. 

Suppose  that  a  set  of  l  basis  vectors  g^,  g 2,  ..  g„  has  been  found  for 


I 
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$  (A)  =  Q^(0);  i.e.  the  g^'s  are  linearly  independent  solutions  of  Ax=0.  In 
practice  the  g^'s  may  be  obtained  as  the  rightmost  columns  of  the  V  matrix  in 
the  SVD  expansion  of  A,  or  by  some  cruder  technique.  For  example  if  some 
quick  method  is  used  to  solve  the  noisy  equation  Ax=0  then  repetition  of  the 
solution  after  tinkering  with  A  (i.e.  inserting  new  data,  scaling  rows,  etc.) 
would  likely  result  in  linearly  independent  solutions.  Regardless  of  how 
obtained,  the  solution  vectors  g^  can  be  used  to  find  0  through  the  following 
algorithm: 

T  T 

§  13  ALGORITHM:  First  define  as  the  £xN  matrix  whose  rows  are  g  ,  g 2, 

T 

...»  g^,  so  that  the  rowspace  of  G^  is  <ji  (A)  =  (2^(0),  thus  identical  to  the 
rowspace  of  G(£;0)  defined  in  Eq.  (11).  Then  starting  with  the  topmost  row  use 
Gaussian  elimination  to  form  the  upper-trapezoidal  matrix  G 2  as  exemplified 
in  Figure  2 (b) .  Then  repeat  the  process  starting  with  the  bottom  row  and  elim¬ 
inating  terms  on  the  right  to  form  the  upper-parallelogramic  matrix  G^  exempli¬ 
fied  in  Figure  2(c).  Assuming  the  rank  of  G^  has  been  preserved  the  rowspace 

of  G3  is  the  same  as  that  of  G(£;0),  so  the  non-zero  portion  of  each  row  of 
.  T 

G^  is  a  replica  of  0  to  within  a  scalar  multiple  (If  a  noisy  matrix  A  is 

used  then  the  rows  are  only  estimates  of  0.)  Now  define  G 4  as  the  £xk'  matrix 

formed  by  "straightening  out"  the  non-zero  paralellogram  of  G ■  and  discarding 

the  zeroes.  In  the  absence  of  noise  or  computational  error  the  rows  are 

collinear  and  rank(G4)  is  1.  If  noise  is  present  then  each  row  is  an  estimate 
T  T 

of  0  .  Logically  a  "best"  estimate  of  0  can  be  obtained  by  using  the  SVD  to 

find  the  best  unit-rank  approximant  to  G^ ,  whose  collinear  rows  then  produce 

T 

identical  estimates  of  0  .  But  this  is  completely  equivalent  to  estimating 

T 

0  as  the  strongest  eigenvector  of  G4G4,  a  task  that  can  be  performed  very 
easily  with  matrix  iteration. 


The  procedure  presented  above  can  be  modified  so  as  to  constitute  a 
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stepwise  reduction  of  £2^(0)  to  £2J_1(0),  ultimately  arriving  at  £2^0)  whose 
single  vector  direction  is  0. 

§  14  THEOREM:  Let  the  N*1  vectors  g^,  _ _ ,  g^  form  a  basis  of  £2  (0)  ,  of 

which  at  least  one  must  then  have  a  non-zero  first  element;  assume  it  is  g  . 

Furthermore  assume  the  g.'s  are  scaled  so  that  the  first  element  of  each  is 

either  1  or  0.  Then  for  i=2  to  SL  define  a.  =  g.-g,  if  the  first  element  of  g. 

l  ’1  i 

is  1,  otherwise  a.  =  g..  Then  the  set  {g  ,a_,a7,..,a  }  is  still  a  basis  of 

£2^(0),  and  the  set  of  contracted  vectors  (a^,^, . . ,a^}  obtained  by  discarding 

the  first  element  of  each  a^  forms  a  basis  of  £2^  ^(0).  d] 

i  15  COROLLARY:  The  theorem  remains  valid  if  the  words  "last  element"  are 

substituted  for  "first  element"  at  every  occurrence.  LH 

N  N-l 

It  should  be  remembered  that  £2  C.  R  while  £2  C  R  .  If  in  the  process 

A*  J6“l 

of  stepwise  reduction  one  uses  Theorem  §14  repeatedly  the  final  result  will  be 
a  single  vector  equivalent  to  the  bottom  row  of  the  matrix  G obtained  with 
Algorithm  §13.  But  if  one  switches  at  some  point  to  using  the  corollary  then 
the  result  will  correspond  to  one  of  the  other  rows  of  . 

Now  we  present  a  formula  for  the  sensitivity  of  the  roots  of  the  generator 
polynomial,  when  estimated  as  the  non-extraneous  roots  of  the  polynomial  pro¬ 
duced  by  x,  where  x  is  some  approximate  solution  to  Ax=0.  Recall  that  if  x 
is  a  unit  vector  then  its  error  component  is  measured  by  | |x  | |. 

§  16  THEOREM:  Suppose  |  |x|  | -1  and  x  lies  approximately  within  <j>  (A)  ,  i.e. 

| |x  | |<<| | x— | | .  Then  the  non-extraneous  roots  of  the  polynomial  xTZ  depart 
from  the  true  roots  z^ ,  » • • <  zK  of  the  generator  polynomial  according  to 

the  first-order  formula: 


dz . 


i 


z . 

l 


-  x  Z. 
A  l 

«T 

x  DZ. 
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where  D  is  a  diagonal  matrix,  D. .=0,  D  =l/k  for  k=2  to  N.  D 

1 1  KJC 

If  in  particular  x  is  x[norm] ,  then  the  above  result  can  be  combined  with 
Theorem  §3  to  produce  an  approximate  error  bound  in  terms  of  | [ E — [ |  and  the 
non-extraneous  roots,  z^,  of  the  polynomial  it  produces.  (The  bound  is 
approximate  because  the  sensitivity  formula  is  correct  only  to  first  order, 
and  estimated  roots  are  used. ) 

- r— ^ — ; —  (12) 

sv  (A)  X  X  DZ. 

K  1  1 1 

In  the  problem  of  finding  8  given  a  noisy  version  0-AR  matrix  A,  it 

A 

might  appear  that  Theorem  §2  should  always  apply,  and  that  $  =  <J>  (A(K;SVD) )  is 
a  maximum  likelihood  estimate  of  <f  (A) .  However,  there  are  two  critical  assump¬ 
tions  in  the  hypothesis  of  Theorem  §2:  (1)  that  the  noise  corrupts  the  ele¬ 
ments  independently,  and  (2)  that  absolutely  nothing  of  A  is  known  except  for 
its  rank.  Clearly  if  A  is  known  only  to  be  of  minimal-width  and  sufficient 
rank  with  0  unknown,  then  we  know  only  that  its  rank  is  one  less  than  its 
width.  (If  any  Mxn  matrix  A  has  rank (A)  =  N-l,  then  it  is  automatically  0-AR 
for  the  vector  0  that  spans  $ (A) . )  So  if  assumption  (1)  is  satisfied  then 
Theorem  §2  does  indeed  apply.  On  the  other  hand  if  A  is  extra-wide  and  0-AR 
then  we  know  mere  than  just  its  rank,  so  assumption  (2)  is  violated.  If  A  is 
known  to  ha^e  been  formed  as  a  Hankel  matrix  then  both  assumptions  are  violated. 

A 

(But  if  by  some  fortuitous  circumstance  A(K;3VD)  does  indeed  conform  to  all 
of  our  prior  knowledge  of  A,  then  the  conclusion  of  Theorem  §2  should  still 
apply.)  These  facts  might  lead  one  to  avoid  using  either  extra-wide  or  Hankel 
matrices.  However  Hankel  matrices  appear  to  be  very  economical  since  each 
new  element  of  data  permits  the  inclusion  of  another  - ow .  Furthermore  several 
investigators  have  proven  that  if  one  uses  an  extra-wide  Hankel  matrix  and 
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traditional  methods  of  solving  Ax=0,  then  the  non-extraneous  roots  of  the 
resulting  polynomial  often  estimate  the  true  roots  much  more  accurately  than 
if  a  minimal-width  matrix  had  been  used  [10,11,12].  Clearly  there  remain  many 
unanswered  questions  regarding  the  efficacy  of  the  various  alternatives. 

POLE  IDENTIFICATION 

To  apply  the  methods  developed  above  to  the  pole  identification  problem 

posed  in  the  introduction  one  has  merely  to  observe  that  every  post-excitation 

output  sequence  that  can  be  elicited  from  a  system  of  order  K  is  0-AR,  where 

the  generator  0  is  the  K'xl  vector  of  the  pole  polynomial’s  coefficients. 

Thus  every  output  subsequence  of  length  N=K+£  will  be  orthogonal  to  52^(0), 

which  we  shall  now  denote  simply  as  52^,  the  "1th  polespace" ,  to  concede  its 

relationship  to  the  system  poles.  Of  course  52^  is  just  the  space  spanned  by 

0  itself.  Thus  any  M*N  data  matrix  A  whose  rows  are  output  sequences  will 
T 

obey  4>  (A)  =  4>  (A  A)  =  52^,  where  £  =  N-K,  provided  enough  rows  have  been  used  to 
give  the  matrix  sufficient  rank  (i.e.  rank(A)  =  K) .  This  will  hold  true  even 
if  the  various  rows  were  obtained  as  a  result  of  separate  tests,  with  different 
input  excitation,  or  different  sensor  placement.  Any  solution  of  Ax=0  will 
produce  a  polynomial  whose  roots  inc.’ude  the  system  poles  together  with  N-K' 
extraneous  ones.  If  decimation  sequences  of  epoch  q  are  used,  then  the  approp¬ 
riate  subspace  is  52  (ij<) ,  which  we  shall  denote  as  52^.  The  K'xl  vector  4>  is 
formed  of  the  coefficients  of  the  polynomial  whose  roots  are  the  qth  powers 
of  the  system  poles. 

§  17  PRONY'S  METHOD  [14]:  Given  a  single  output  sequence  of  length  2K,  form 
an  MxN  minimal-width  Hankel  matrix  A,  where  M=K  and  N=K' .  Then  solving  Ax=0 
gives  the  generator  0 .  Traditionally  this  is  done  by  converting  to  a  non- 
homogeneous  equation,  which  in  this  case  is  entirely  equivalent  to  forcing 
0*1.  Since  the  equation  is  not  overspecified  it  always  has  an  exact 
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solution,  and  the  only  error  is  due  to  noise  in  A.  Experience  proves  that 
even  a  small  amount  of  noise  can  be  devastating  [14]. 

§  18  LEAST-SQUARES  PRONY  METHOD  [10,11,12]:  Given  an  output  sequence  of 

length  greater  than  2K  one  can  still  form  an  MXN  minimal  width  Hankel  matrix 

A,  where  N=K*  but  now  M>K.  The  equation  Ax=0  is  now  overspecified.  Following 

the  traditional  approach  discussed  earlier  one  seeks  to  determine  the  null- 
T  T 

space  (A)  =  <j>  (A  A)  by  solving  A  Ax=0.  Even  this  equation  is  overspecified 
T  T 

since  A  A  is  K'xK'  but  rank (A  A)  =  K.  If  the  rightmost  column  of  A  can  be 

partitioned  off  without  altering  the  rank,  i.e.  A  =  [A+;a]  where  "a"  is  the 

rightmost  column  of  A  and  the  submatrix  A+  is  nonsingular,  then  the  equation 
T 

A  Ax=0  can  be  expressed  as 


T  i  T 
A+A+  j  A+a 

T  i  T 
a  A+  j  a  a 


x  =  0. 


The  solution,  normalized  so  that  the  last  element  of  x  is  one,  can  be 
expressed  immediately  as 


x  = 


T  -IT 
•  (A^A+)  A^a 


(13) 


Since  A  is  a  minimal  width  data  matrix  the  solution,  x,  given  by  Eq.  (12) 

should  be  the  generator  vector  9,  since  the  solution  of  Ax=0,  or  equiva- 
T 

lently  A  Ax=0,  is  unique  to  within  a  scalar  multiple.  Thus  the  same 

T 

answer  would  result  if  one  used  the  adjoint  solution  x  =  adj (A  A)xQ,  regard¬ 
less  of  the  choice  of  x^.  On  the  other  hand  if  a  corrupted  version  of  A 
is  used,  namely  A  =  A+E  where  the  noise  matrix  E  increases  the  rank  so  that 
rank (A)  is  K' ,  then  one  expects  that  the  adjoint  solution  would  produce  a 
result  different  from  that  of  Eq.  (13).  However  the  following  theorem 
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establishes  a  connection. 

§  19  THEOREM:  Let  A  =  [A+;a]  where  rank (A)  =  rank(A+)  =  K  and  A+  is  non¬ 
singular  and  let  A  =  A+E  =  [A+;a]  where  rank (A)  is  K'.  Then 


provided 


~T~ 

adj (A  A)Xq  = 


~T~  -1-T- 

-(A*A+)  A^a 


xQ  =  [0,0,...  ,0,ct]' 


A  a  {I-A+(AA>  \}a 

° - - 

det (A  A) 


□ 


Many  users  of  this  least-squares  Prony  Method  have  demonstrated  quite 

clearly  that  when  one  "requests"  more  poles  than  actually  exist,  i.e.  when 

one  constructs  A  to  be  extra-wide,  the  resulting  non-extraneous  poles  estimate 

the  system  poles  with  greater  accuracy  [10,11,12].  This  fact  persists  even 

when  data  are  generated  artificially  to  ensure  that  only  a  finite  number  of 

T  -1 

poles  are  really  present.  Theoretically  (A+A+)  should  not  even  exist, 

.  T 

since  A+A+  is  then  an  (N-l)x(N-l)  matrix  of  rank  K<N-1.  Clearly  it  is  possi¬ 
ble  to  carry  out  the  computation  of  Eq.  (13)  only  because  of  noise  inherently 

.  T 

present  in  A  or  introduced  by  imperfect  computation  of  A+A+.  Indeed  in  the 

presence  of  noise  we  may  regard  Eq.  (13)  as  a  special  case  of  the  adjoint 

solution  through  Theorem  §19.  Thus  when  an  extra-wide  data  matrix  is  used 

the  least-squares  Prony  Method  succeeds  only  because  of  noise,  and  Eq.  (13) 

T 

produces  an  (approximate)  solution  to  the  equation  A  Ax=0.  Moreover  the 

T 

fact  that  the  roots  of  the  resulting  polynomial  x  Z  include  the  true  system 
poles  as  a  subset  has  previously  been  observed  only  experimentally;  it  has 
remained  unjustified  by  theory  until  now  (i.e.  in  our  Theorem  §6).  But 
the  fact  that  poles  can  be  estimated  more  accurately  by  using  an  extra-wide 
matrix  remains  to  be  justified  by  theory.  (The  reader  may  argue  this  point 
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with  references  to  "generalized  least  squares",  noise  modelling,  decorrela¬ 
tion  of  residuals  and  the  like,  but  we  shall  contend  in  a  later  section  that 
such  an  explanation  has  serious  logical  gaps,  at  least  as  it  pertains  to  our 
problem. ) 

The  methods  described  in  preceeding  sections  of  this  paper  provide  many 
alternate  avenues  toward  the  solution.  Specific  approaches,  results,  and 
adaptations  are  discussed  below: 

§  20  CONSTRUCTING  DATA  MATRICES  FROM  A  SINGLE  OUTPUT  SEQUENCE  {y (n) :n=0, 1,2, 

. . . } :  The  construction  of  a  Hankel  matrix  as  described  under  the  least  squares 

Prony  method  is  an  obvious  course  of  action.  However  in  that  case  the  noise 

contributions  are  not  statistically  independent  from  element  to  element  in  the 

matrix,  since  the  noises  appear  repeatedly  along  with  the  elements  of  (y(n) }. 

But  it  is  possible  to  construct  an  Mxn  data  matrix  in  which  the  noises  are 

independent  by  using  a  "sliding  grid"  of  decimated  subsequences,  for  example 

A„  =  y^  where  r  =  (i-1)  +  (j-l)q  and  q(>M)  is  the  decimation  epoch.  An 

advantage  of  the  noise  independence  is  that  it  is  possible  to  apply  Theorems 

§1  and  §2.  But  in  applying  Theorem  §1  to  obtain  an  unbiased  estimate  of  0 

it  is  necessary  to  have  two  replicates  of  the  same  sequence  {y(n)},  each  with 

different  (and  totally  independent)  noise  components,  to  construct  A^  and 

Also  note  that  Theorem  §2  is  of  somewhat  limited  value,  since  the  K-rank 
a 

approximant  A(K;SVD)  probably  will  not  strictly  possess  the  '‘.sliding  grid" 
characteristic  (i.e.  its  rows  cannot  be  reassembled  into  a  single  autoregres¬ 
sive  sequence) ,  and  thus  §  is  not  the  true  maximum  likelihood  estimate. 

Whether  or  not  the  construction  of  matrices  with  independent  noise  components 
has  other  benfits  beyond  application  of  Theorems  §1  and  §2,  or  perhaps  even 
has  drawbacks,  has  not  been  demonstrated. 


§  21  USING  ROWS  OF  A  DERIVED  FROM  DIFFERENT  EXCITATIONS  OR  SENSORS:  A 


m 
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single  output  sequence  {y(n)}  may  reveal  some  poles  only  weakly,  and  provide 

only  a  brief  glimpse  of  poles  having  a  rapid  decay  rate  (i.e.  small  |  z  ,J  ) . 

The  advantages  of  obtaining  separate  output  sequences,  using  totally  different 

system  excitations  (to  within  the  limits  of  one's  ability  to  control  the 

excitation  at  all) ,  and  then  using  them  to  construct  more  rows  of  A,  has  been 

largely  ignored  in  the  literature.  Further,  the  use  of  independent  rows  may 

enable  use  of  Theorems  §1  and  §2  to  obtain  "nice”  estimates  of  0. 

i  22  USING  "BETTER"  SOLUTIONS  TO  Ax=0.  Since  investigations  of  the  past 

have  usually  converted  Ax=0  to  a  nonhomogeneous  problem,  the  cost-effectiveness 

of  using  the  other  solutions  we  have  discussed  is  worth  exploring.  Indeed  the 

norm  minimizing  solution  x[norm]  is  not  that  much  more  difficult  to  obtain. 

The  least-squares  Prony  solution  of  Eq.  (13)  is  usually  obtained  by  solving 
T  T 

(A+A+)x+  =  A+a,  where  x+  is  all  of  x  except  for  its  last  element,  rather  than 

actually  inverting  the  (N-l)x(N-l)  matrix.  However  if  an  extra-wide  matrix  A 

.  -T~ 

xs  being  used  (which  is  almost  always  the  case)  then  the  matrices  (A+A+)  and 

-T- 

(A  A)  are  nonsingular  only  because  of  noise,  and  matrix  iteration  with 
~T'  -1 

(A  A)  can  find  x[norm].  But  this  can  be  done  by  successively  solving 
~T- 

A  A  x^+^  =  x^,  with  occasional  renormalization.  Indeed  the  solution  for  x 
found  from  Eq.  (13)  could  be  used  as  the  initial  guess  for  x.  Thus  instead 
of  solving  a  single  (M-l)x(M-l)  nonhomogeneous  matrix  equation  we  solve  an 
M*M  equation  repeatedly;  hopefully  a  few  repetitions  will  suffice.  Of  course 
with  A  extra-wide,  x's  polynomial  will  have  some  extraneous  poles  to  sort  out. 

§  23  EXPUNGING  EXTRANEOUS  POLES:  When  an  extra-wide  A  is  being  used. 

Algorithm  113  or  Theorem  §14  provide  a  means  for  eliminating  the  extraneous 
poles  if  K  linearly  independent  solutions  of  Ax=0  are  first  found.  These 
independent  solutions  can  be  obtained  by  use  of  SVD  or  perhaps  simply  by 


repeating  the  solution  after  some  tinkering  is  performed  on  A  (for  example 
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by  the  introduction  of  some  new  data  containing  new  noise) .  Whether  the  roots 
of  the  resulting  0 -polynomial  will  be  more  accurate  estimates  of  the  true 
poles  than  are  the  non-extraneous  roots  of  the  larger  x-polynomial  remains 
to  be  determined,  but  at  least  this  approach  can  reveal  which  roots  of  the 
x-polynomial  are  extraneous  in  a  fashion  reminiscent  of  the  de-aliasing  pro¬ 
cedures  described  earlier. 

§  24  PREFILTERING  OF  DATA:  It  may  be  desirable  to  prefilter  a  data  sequence 
{y(n)}  by  some  simple  digital  filter,  for  example  to  improve  the  signal-to- 
noise  ratio.  This  may  be  done  to  enhance  the  accuracy  in  determining  a 
particular  pole  or  set  of  poles,  as  by  the  use  of  a  bandpass  filter.  After 
such  filtering  the  data  sequence  has  the  filter  poles  incorporated  into  its 
generator  polynomial.  This  increase  in  the  number  of  poles  must  be  taken  into 
account  when  constructing  the  A  matrix.  These  new  poles  are  known,  and  ought 
to  be  forced  into  the  solution  somehow  (see  below) .  Moreover  if  the  use  of 
a  particular  prefilter  enables  us  to  accurately  determine  seme  subset  of  the 
poles,  then  when  we  use  another  prefilter  to  enhance  estimation  of  other 
poles  we  should  force  our  previously  estimated  poles  into  the  solution. 

§  25  FORCING  KNOWN  POLES:  The  use  of  prefilters  is  not  the  only  impetus  for 
wanting  to  force  known  poles.  The  choice  of  the  sampling  interval  T  or  deci¬ 
mation  epoch  "q"  may  be  tailored  to  accurate  estimation  of  particular  poles, 
and  once  determined  these  estimates  should  be  forced  into  subsequent  analyses 
done  with  different  T  or  q.  Forcing  poles  is  not  difficult.  Assuming  for 
simplicity  that  A  is  a  matrix  of  ordinary  (i.e.  not  decimated)  data,  then 
<j>  (A)  =  £2^.  And  the  power  vector  Z^  of  each  system  pole  z^  is  orthogonal  to 

£2^  by  Theorem  §7.  But  if  a  z,  is  known  then  we  need  only  solve  Ax=0  subject 

T  T  0  1  N 

to  Z^x*0.  If  the  pole  z^  is  real,  then  Z^  =  [z^,z^, _ zi^  can  mere^y  be 

adjoined  to  A  as  a  new  row  with  a  large  scaling  constant  C.  The  larger  C, 


1 
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the  more  strongly  the  pole  z^  is  forced  into  the  solution  of  Afx=0,  where  A 
is  the  new  matrix  formed  by  addition  of  this  new  row.  If  the  pole  is  com¬ 
plex  then  the  same  procedure  could  be  used,  but  this  has  the  unfortunate 
effect  of  making  A  complex.  A  better  approach  is  afforded  by  realizing  that 

complex  poles  occur  only  in  complex  conjugate  pairs,  so  that  instead  of 

T  T  r*0*l  *N  TA  T  *T 

adjoining  Z ^  and  Z^+1  =  [  (zj  ,  (zl  )  , —  ,  (zj  ]  one  adjoins  a^  =  C/2[Z  +  (ZJ  ] 

T  A  —1  T  *  T 

and  a2  =  —  jC  [Z^  -  (Z^)  ]  which  are  both  purely-real  row  vectors,  and  can 

be  expressed  as 


T  _r,  2  .  N  , 

=  Cll,  rcosw,  r  cos2u>,..,  r  cqsNio]  , 

T  2  N 

and  a2  =  C[l,  rsinto,  x  sin2u>, . . ,  r  sinNw] 

with  r  =  |z^|  and  a>  =  arg(z^). 

Thus  for  each  pole  forced,  one  more  row  is  added  to  A.  The  choice  of 
the  scaling  factor  C  is  ad  hoc,  and  some  adjustment  may  be  necessary. 

§  26  USE  OF  DECIMATED  DATA.  Everything  that  has  been  said  concerning  the 
linkage  between  A  and  6  when  A  is  not  composed  of  decimated  data  carries  over 
to  a  linkage  between  A  and  i|>  when  A  xs  decimated,  where  <l>  defines  a  polynomial 
whose  roots  are  the  q^1  powers  of  the  system  poles.  Obvious  modifications 
are  needed  in  a  few  places. 


THE  METHOD  OF  JAIN  AND  ITS  EXTENSIONS 


Transformation  of  a  complex  variable  according  to  some  formula  ?  =  f (z) 
is  often  employed  in  polynomial  root  solving  programs,  and  the  resulting 
roots  cure  then  transformed  back  into  the  z-space.  Previously  we  have  seen 
that  if  one  transforms  the  pole  polynomial  produced  by  0  to  a  new  polynomial 
whose  roots  are  the  powers  of  the  true  poles ,  the  new  f  vector  is  related 
to  decimated  output  sequences  in  the  same  way  that  0  is  related  to  undeci- 
mated  sequences  (y(n)}.  Suppose  instead  that  the  pole  polynomials  were  modified 
by  a  simple  linear  transformation  of  the  variab le  z.  Would  the  new  polynomial 
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coefficient  vector  be  related  to  some  other  modification  of  the  data  sequence 
{y(n)}?  The  answer  is  yes.  Suppose  the  transformation  is  ?  =  [l-c^z]/^  so 
z  =  converting  the  9-polynomial  to  a  y-polynomial: 


<e0  +  e1z  +  ••  9KzK)  =  (Uq  +  +  ••  pkck) 


(14) 


where  5  =  [l-c^z]/c^ ■  Solving  the  ^-polynomial  and  then  transforming  the 
roots  will  give  the  roots  of  the  6-polynomial.  Applying  Eq.  (14)  to  Eq.  (6) 
and  clearing  the  denominators  gives 

{pQ  +  y].[  (l-CjZJA^l  +  ...  +  yK[  (l-c1z)/c2]K}  Y(z)  =  3k_jZK  +  •••  +  BjZ2  +  8Qz, 

where  Y(z)  is  the  Z-transform  of  {y(n)}  and  the  8^'s  are  the  coefficients  in 
the  numerator  polynomial  of  Y(z)/z.  Defining  H(z)  =  c2/ (1-c^z)  and  multiply- 
ing  both  sides  by  H(z)  gives 

P0HK(z)Y(z)  +  y1HK_1(z)Y(z)  +  ...  +  yKY(z)  -  HK(z)  {6k_1zK  +  ...  +  8jZ2  +  eoz} 

The  assumption  that  H(z)  is  the  Z-transform  of  some  filter 
impulse  response  {h(n)}  whose  region  of  consequence  is  compatible  with  that 
of  {y(n)}  permits  us  to  take  the  inverse  transform  of  Eq.  (15)  to  get 

y0nK(n)  +  ylnK-l(n)  +  •••  +  yKn0(n)  =  {h(n)**K:  t 6k_ 1 6 (n+K)  +  ...  +  Bq6 (n+1) } 

K  (16) 

where  {h(n)}*  :  represents  convolution  (filtering)  by  h(n)  for  a 

total  of  K  times,  and  {nQ(n)}  =  {y(n)};  {^(n)}  =  {h(n)  }*{n0(n)  };  .... 

etc.,  i.e.  the  {n^(n)}  sequence  is  the  result  of  filtering  the  data  sequence 

i  times.  But  what  kind  of  filter  is  represented  by  H(z)?  Actually  H(z)  has 

two  inverse  Z-transforms ,  one  causal  and  one  not,  corresponding  to  the 

difference  equations: 

x  ^(n+1)  *  cT^x  ,(n)  -  (c./c.Jx.  (n) 
out  1  out  21m 


Causal : 


(17) 


Non  Causal:  x  .  (n)  =  c  x  (n+1)  +  c  x.  (n)  (18 

out  1  out  2  in 

We  shall  choose  the  non-causal  filter,  for  whose  Z-transform  the  region 

of  consequence  is  |z|  <  |c1|  1,  so  that  to  be  compatible  with  the  region  of 

convergence  for  {y(n)}  one  must  have  I c  |  ^  >  max{ | z  | }  where  the  z.  's  are 

1  k  *  1 

the  system  poles.  The  filter  has  the  property  that  inputs  arriving  at  n<0 

will  contribute  to  the  output  only  for  n<0 .  Since  the  6-sequences  on  the 

right  hand  side  of  Eq.  (16)  are  already  zero  for  n>0,  K-fold  convolution  by 

{h(n)}  will  preserve  this  property,  so  for  n>0  Eq.  (16)  gives  yQn0(n)  + 

Vi, (n)  +  ...  +  u  n„(n)  =  0  for  n>0,  i.e.  the  sequences  n.  (n)  are  linearly 

dependent  through  the  vector  y  =  [y^y^, • . - y^] -  Furthermore  recall  that 

{nQ(n)}  =  {y(n)},  and,  using  the  difference  equation  for  H(z),  ni+1(n)  = 

c ni+1(n+l)  +  c2ni(n)  for  i,  n  >_  0. 

Thus  given  a  {y(n)}  sequence  generated  by  K  poles,  we  can  prefilter  it 

K  times  using  the  first  order  non-causal  filter  described  in  Eq.  (18)  to  get 

the  ni<n)  sequences,  and  then  construct  an  M*K'  data  matrix  A,  where 

T 

^(i-1)  and  then  solve  Ay  =  0  or  (A  A)y  *  0  for  the  K'xl  vector  y. 
Then  the  corresponding  y-polynomiai  can  be  solved  for  its  roots,  which  are 
then  subjected  to  a  simple  linear  transformation  to  get  the  system  poles. 

Jain's  method  [7]  is  a  special  case  of  this  procedure  wherein  c^  =  = 

so  that  the  filtering  is  a  simple  reverse-time  integration  of  a  discrete 
sort,  and  his  A  matrix  is  essentially  infinite  in  height  (i.e.  M»l) .  This 
gives  a  solution 

T 

y  *  ad 3 (A  A)yQ 

A  T 

where  y^  is  an  arbitrary  K'*i  vector.  Note  that  with  P  =  (A  A)  we  have 
M  M-l 

P..  *  £  (A) „ , (A) . ,  =  Z  n.  , (n)  n.  , (n) ,  where  the  summation  is  extended 

ij  £=i  *0  n=0  1-1  3-1 
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to  «  when  M-*-®  to  be  in  conformance  with  Jain's  results,  wherein  our  P  matrix 
corresponds  to  Jain's  "Grammian"  matrix.  Moreover  Jain  uses  the  diagonal 

elements  of  adj[P]  so  that  the  y-vector  is  [  (± )  | j ** »  .  *  ^K'l^K'K'l  ^ 

T  ^ 

=  y  where  are  the  diagonal  cofactors  of  R,  and  the  (±K  notation  has 

been  discussed  earlier.  The  potentially  annoying  (±)..  signs  might  better  be 

avoided  by  simply  using  one  of  the  other  ways  of  constructing  the  solution  of 

Py  «  0.  (See  our  section  entitled  "SOLVING  THE  HOMOGENEOUS  MATRIX  EQUATION.") 

THE  POLES  AS  EIGENVALUES 

§  27  THEOREM:  Given  an  M*K'  data  matrix  A,  whose  rows  are  post-excitation 
output  sequences  from  a  system  having  K  poles,  the  poles  z are  the  eigen¬ 
values  X  in  the  following  generalized  eigenvalue  problem: 

(A*A_)x  *  X(A^A+)x  (19) 

where  A+  is  A  with  its  rightmost  column  removed,  and  A  is  A  with  its  left¬ 
most  column  removed.  □ 

The  usefulness  of  this  result  is  difficult  to  determine.  At  least  it 
provides  a  restatement  of  the  problem  in  which  the  poles  are  eigenvalues , 
and  in  which  the  data  appear  in  fairly  simple  form  with  no  matrix  inverses, 
offering  the  hope  that  amethod  can  be  employed  to  directly  determine  the 
eigenvalues  without  having  to  compute  polynomial  coefficients,  hence  avoiding 
what  is  generally  considered  to  be  a  poorly  posed  problem.  Perhaps  as  better 
numerical  algorithms  become  available  for  the  Ax=XBx  eigenvalue  problem,  one 
of  them  can  successfully  be  applied. 

It  is  interesting  to  note  that  the  eigenvalue  problem  can  also  be 
written  as 

A^(A_  -  X  A+)x  =  0. 


2 


(20) 
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Equation  (20)  has  a  solution  only  if  the  matrix  A+(A_-  X  A+)  is  singular, 
which  must  occur  whenever  X  is  a  system  pole.  The  similarity  of  this  to 
Jain's  "pencil  of  functions"  concept  is  noteworthy,  particularly  in  view  of 
the  following  observation:  The  i*"*1  row  of  A  is  a  data  sequence  {d^  (n)  : 
n  =  0 , . . , K}  so  that  the  ifc^  row  of  (A_-  X  A+)  is  {d^(n+l)  -  X  d^(n):  i  =  0,.., 
(K-l) } . 


OTHER  ESTIMATION  CRITERIA,  ITERATIVE  METHODS,  AND  ASYMPTOTIC  ERROR 


For  a  single  post-excitation  output  sequence  {y(n)}»  Eq.  (6)  shows  that 

the  Z-transform,  Y(z),  can  be  expressed  as  a  function  of  the  parameter  vectors 
T  T 

9  =  [0. ,0. , . . . ,0  ]  and  8  =  [8  ,8, , . . . 8V  .]  that  consist  of  the  coefficients 

U  X  K  0  1  K~  1 

r f  the  pole  polynomial  and  numerator  polynomial.  Equation  (6)  can  be 
expressed  as 


Y(z) 


8r_1  +  SK_2Z_1  +  • • •  +  3qz 
0K  +  0K-1**1  +  •••  +  V 


(21) 


We  can  define  8+  and  0+  as  the  finite  impulse  response  (FIR)  operators 
(Filters)  whose  transfer  functions  are  the  numerator  and  denominator  poly¬ 
nomials  of  Equation  (21) ,  and  9+  as  the  recursive  infinite  impulse  response 
(HR)  operator  that  is  the  reciprocal  of  0+.  To  illustrate,  (a(n)}  =  8+{b(n)} 

means  a(n)  =  0„  b(n)  +  Q  b(n-l)  +  —  +  8_b(n-K+l)  and  {a(n)}  =  0+{b(n)} 

K-l  K— 2  0 

means  a(n)  =  [b(n)  -  0  a(n-l)  -  ...  -  0  a(n-K)]/0  for  -«  <  n  <  +  00 .  with 

K“  1  0  K 

this  notation  0+0+  is  the  unit  operator  and  the  sequence  {y(n)}  may  be 
expressed  by  inverting  the  Z-transforms  in  Equation  (21)  as 

{y (n)  }  =  0+6+  {<5  (n) }  (22) 

where  { 6 (n) }  is  the  impulse  sequence.  If  the  data  sequence  is  corrupted  by 
noise  to  give  {y(n)}  =  {y(n)  +  e(n)}  where  the  e(n)'s  are  zero  mean. 
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independent  Gaussian  random  variables  with  common  variance  a  ,  then  the  log- 
likelihood  function  for  {y(n)}  given  $  and  0  is 

1  2 
L  - - —  £  (y(n)  -  y(n))  +  const. 

2a  n>0 

where  {y(n)}  is  defined  in  terms  of  the  parameter  vectors  0  and  0  by  Equation 

(22).  Thus  the  maximum  likelihood  estimate  of  0,8  is  obtained  by  choosing 

them  to  minimize  J,  =  £  (y(n)  -  y(n))2,  i.e.  to  minimize  the  mean  square 

1  n*0 

error  in  fitting  (y(n)}  to  the  data  sequence  {y(n)}.  If  the  noise  components 
{e(n)}  are  vanishingly  small  then  the  minimum  value  is  J^=0.  In  that  case 
{y(n)  -  y(n)}  =  0  and  this  null  sequence  could  be  operated  upon  by  9+  to 
give  0+{y(n)}  -  0+{y(n)}  =  0t{y(n)}  -  8t{6(n)}  =  0.  Thus  the  same  result 


would 


have  been  achieved  by  minimizing  =  £  r' 

1  n>0 


(n)  where  the  seouence 


{r (n) }  =  0+{y (n) }  -  8+{5(n)}.  But  J2  can  be  decomposed  as  3^  =  3^Q  +  J^, 

A  K-l  2  A  2 

where  J  =  £  r  (n)  and  J  =  £  r  (n) .  Furthermore  the  sequence  8H<S(n)} 

20  n=0  21  n>x 

is  identically  zero  for  n>K,  so  depends  only  upon  0 .  Indeed  can 
always  be  minimized  to  zero  by  setting  6V  .  =  0__y(O)}  &  =  0  y(l)  +  0„  _y(0); 

etc.  ...  ;  and  ultimately  8n  =  0vy(K-l)  +  0  y(K-2  +  ...  0.y(O). 

Thus  the  estimate  of  0  is  obtained  by  choosing  it  to  minimize  J  ,  ^'e* 

to  minimize  the  mean  square  output  of  the  0+  FIR  filter  for  n>K  when  the 
input  is  the  data  sequence  {y(n)}.  But  if  one  constructs  an  M*K'  Hankel 
data  matrix  A  according  to  A_  =  y(i+j-2)  then  3^  =  0TATA0  so  that  the 
optimum  0  is  simply  the  xfnorm]  solution  of  Ax=0  discussed  in  earlier  sec¬ 
tions  of  this  paper. 

However  minimizing  is  not  really  equivalent  to  minimizing  J unless 
both  can  actually  be  minimized  to  zero,  i.e.  the  noise  is  vanishingly  small. 

Otherwise  the  x[norm]  solution  does  not  minimize  J  ,  and  is  therefore  not 
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the  maximum  likelihood  estimate.  Direct  minimization  of  is  difficult, 
being  a  highly  nonlinear  problem.  Steiglitz  [18]  has  very  neatly  described 
the  "iterative-prefiltering"  procedure  for  minimizing  by  choosing  the  FIR 
operators  9+  and  8+  to  minimize 

J  =  Z  (0+§+{y (n) }  -  8+§+{6(n)})2 
3  n>0 

where  §  +  is  the  HR  operator  defined  as  the  reciprocal  operator  to  the  FIR 
operator  0+  resulting  from  the  previous  step  in  the  iteration  process.  (The 
procedure  can  be  started  by  taking  6  as  the  xfnorm]  solution.)  Each  time 
the  minimization  is  done  the  resulting  9  is  used  as  §  in  the  next  step.  If 
the  procedure  converges  then  0=0  and  so  the  resulting  0  is  a  maximum 

likelihood  estimate. 

A  different  iteration  method  can  be  used  to  "improve"  the  estimate  of  0 
beyond  that  of  simply  minimizing  J21  *  6  A  A0  *  j | A0 J |  ,  although  it  is 
based  on  heuristic  arguments.  Even  when  0  is  truly  correct  the  residual 
vector  A9  cannot  be  expected  to  possess  uniform  statistical  variance  in  its 
elements.  Due  to  the  Hankel  matrix  form  of  A  the  residual  vector  can  be 
expressed  as  A0  =  G(M;0)y  where  G  is  the  matrix  defined  in  Equation  (11)  and 
y  is  (M+K) *1  the  vector  whose  elements  form  the  sequence  {y(n) :n=0, . . . , (M+K-l) }. 
But  y  can  be  expressed  as  y  =  y+e  where  y  and  e  are  the  vectors  of  the 
uncorrupted  and  noise  components  respectively,  and  if  9  is  the  true  solu¬ 
tion  then  GySO  since  every  data  subsequence  of  length  K’  is  orthogonal  to  9. 
Hence  A0=G(M;0)e  and  is  therefore  a  zero-mean,  Gaussian  random  vector  with 

co-variance  matrix  A  =  f[A0)  (A0)T]  =  f  [G (M; 0)  eeTGT (M; 0)  ] ,  or  A  =  <j2GGT 
T  2 

since  ff [ee  ]  =  a  I  and  the  G  matrix  is  not  random.  If  A  were  known  a  priori 

^  2 

then  instead  of  minimizing  =  | | A9 | |  one  might  prefer  to  minimize  the 

A  |  |  %  i  i  2  t-T 

weighted  quadratic  form  =  | | A9 | j  _^  =  0  A  A  A0.  Unfortunately  the 
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2  T 

matrix  A  =  a  GG  cannot  be  computed  without  knowing  0 ,  but  an  iterative  pro¬ 
cedure  can  be  employed  in  which  each  new  estimate  of  9  is  used  to  compute  G 
and  estimate  A  for  the  next  step.  This  procedure  is  a  modification  of  that 
used  in  Refs.  [19]  and  [31]. 

In  the  control  theory  literature  the  J criteria  is  the  one  most  often 
used  for  minimization  [15,16].  However  in  that  context  the  problem  is 
usually  complicated  by  the  presence  of  a  persistently  exciting  input  to  the 
system.  Furthermore  there  is  much  emphasis  placed  on  the  problem  of  "bias" 
in  the  estimate  of  9 ,  but  it  is  not  statistical  bias  of  the  type  dealt  with 
in  our  Theorem  §1  and  the  discussion  following.  Rather  it  pertains  to  asymp¬ 
totic  bias  in  the  estimate  of  0  as  the  observation  interval  of  the  system 
output  [y(n)}  becomes  infinite.  Indeed  in  that  context  the  asymptotic  bias 
is  connected  to  statistical  correlation  in  the  residuals  [15],  a  problem  that 
can  be  alleviated  by  the  use  of  pre-whitening  filters,  "generalized-least- 
squares" ,  or  often  simply  by  supposing  that  the  system  is  of  higher  order 
[16].  By  these  approaches  one  can  theoretically  produce  an  estimate  of  9 
that  converges  to  the  actual  0  (not  just  its  maximum  likelihood  estimate) 
as  the  interval  of  observation  becomes  infinite.  In  view  of  the  last 
approach  one  might  conclude  that  the  use  of  extra-wide  data  matrices  in  our 
problem,  since  it  is  equivalent  to  supposing  a  higher  system  order,  is 
therefore  justified  by  the  experience  of  researchers  in  the  control  theory 
area.  However  in  our  problem  there  is  no  persistently  exciting  input,  and 
the  "signal"  portion  of  a  noise-corrupted  data  sequence  {y(n)}  =  (y(n)}  +  (e(n)} 
will  eventually  decay  to  insignificance,  leaving  only  the  noise,  and  we 
cannot  expect  the  estimate  of  9  to  converge  to  the  true  value.  Indeed  it 
will  almost  certainly  begin  to  diverge  as  soon  as  the  signal  component 
decays  to  the  point  of  becoming  lost  in  the  noise. 
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But  there  is  a  line  of  heuristic  reasoning  that  lends  some  relevance  to 

the  asymptotic  behavior  described  above.  Suppose  the  signal  component  of 

{y{n) }  does  strongly  persist  for  a  long  enough  time  that  the  statistical  law 

of  large  numbers  can  be  applied  to  computations  involving  the  noisy  data; 

i.e.  the  observation  interval  is  infinite  with  respect  to  the  noise  but 

finite  with  respect  to  the  signal.  Largeness  of  the  interval  of  observation 

corresponds  to  largeness  of  M  in  the  M*N  Hankel  data  matrix  A.  The  nature 

r 

of  the  "quasi-asymptotic  error"  in  estimating  0  using  A  can  be  determined  by 

studying  the  matrix  ATA  =  (A+E)T(A+E)  =  ATA  +  2(ATE)g  +  ETE,  where  E  is  the 

Hankel  matrix  of  error  components,  and  ()g  denotes  the  symmetric  part  of  a 

matrix.  Every  time  M  is  increased,  new  rows  are  added  to  the  A  and  E  matrices. 

T 

This  means  that  the  elements  of  the  N*N  matrix  A  A  will  continue  to  reflect 
these  additional  summed  components  for  as  long  as  the  signal  persists.  How¬ 
ever  the  matrix  E  is  a  Hankel  matrix  derived  from  a  sequence  of  independent, 

2 

zero  mean  Gaussian  variables  of  variance  a  ,  and  it  immediately  follows  that 
T  2 

E  E  tends  asymptotically  toward  M*I*0  where  I  is  the  identity  matrix.  More- 
T 

over  (A  E)  is  a  linear  combination  of  the  independent,  zero-mean  noise 

variables,  and  can  be  expected  to  converge  to  its  mean  (zero)  by  the  law  of 

T  T 

large  numbers;  i.e.,  it  becomes  insignificant  compared  with  A  A  and  E  E. 

(This  is  a  broad  conclusion  which  would  require  an  unwieldy  set  of  assump¬ 
tions  and  pre-conditions  to  attain  mathematical  rigor) .  The  conclusion  can 

~T~  -  T  2 

be  stated  succinctly;  For  M>>1,  A  A  =  A  A  +  M*o  *1  where  I  is  an  N*N 
identity  matrix.  Moreover  if  the  rightmost  column  of  A  is  isolated  by  par- 

-  ~  -»<p  -  «. 

titioning  it  as  A  =  [A+;a],  then  by  a  similar  line  of  reasoning  A+A+  = 

T  2 

A+A++  M«c  •!  where  I  is  now  an  (N-l)x(N-l)  identity  matrix.  With  these 
approximations  it  is  possible  to  estimate  the  quasi-asymptotic  error  in 
determining  0  given  an  M*N  data  matrix  A  with  M>>1. 
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First  consider  the  least-squares  Prony  solution.  In  this  case  A  is 

Mxk*  and  the  solution  xT  =  [x^>l]  estimates  0T,  where  from  Eq.  (13) ,  x+  = 

_i~t~  ~t~  ~t~ 

-(A  A  )  A  a  which  means  that  x  is  the  solution  of  A  Ax  =  -A  a.  If  one 
+  ++  +  +++  + 

T  2  -T- 

uses  the  approximations  just  derived,  the  result  is  A+A+x+  +  Mo  x+  =  -A+a 

T  *•  T  2  T  T  T 

or  A+A+x+  =  -A+a  +  Y  w^ere  y  =  -Mo  x+  -  A+e  -  E+a  -  E+e .  The  result  is 

clearly  a  perturbation  in  the  solution,  a  fact  that  has  been  explored  in 

more  detail  by  Kay  [17].  The  behavior  of  the  adjoint  solution,  x  * 


~T~  ~T~  -  T  2 

adj (A  A)Xg,  in  light  of  the  approximation  A  A  =  A  A  +  Mo  I  is  somewhat 

similar,  but  it  is  more  instructive  to  view  it  as  one  step  in  the  matrix 

A 

iteration  process  from  xQ  toward  x[norm],  whose  asymptotic  behavior  is  dis¬ 
cussed  below. 

A  A 

Fortunately  the  solution  x[norm],  wherein  x  is  chosen  to  minimize 


| |Ax| |  subject  to  | | x | |  =1,  has  no  quasi-asymptotic  error  since  j |ax| |2  = 
xTATAx  *  xT(ATA  +  Mo2I)x  =  | |Axj  j2  +  M02j | x| | 2  which  obviously  leads  to  the 
same  solution  as  if  there  had  been  no  noise.  Of  course  we  do  not  mean  to 

A 

suggest  that  x[norm]  is  absolutely  errorless,  since  our  argument  is  funda- 


-T~  -  T  2 

mentally  limited  by  the  accuracy  of  the  approximation  A  A  =  A  A  +  Mo  I. 
Nevertheless  it  is  clear  that  x[norm]  is  free  of  the  "asymptotic  bias"  that 
is  given  so  much  attention  in  the  control  theory  literature,  and  this  is 
perhaps  the  strongest  argument  in  its  favor.  Before  continuing  to  the  next 
section  we  remind  the  reader  that  the  above  discussion  pertains  strictly  to 
estimating  0  from  a  single  data  sequence  (y(n)}. 


PREVIOUS  WORK  AND  UNANSWERED  QUESTIONS 
Since  the  problem  treated  in  this  paper  is  in  most  respects  an  extension 
of  that  studied  by  Prony  in  the  eighteenth  century,  not  surprisingly  there 
exists  more  literature  than  can  be  referenced  here.  Nevertheless,  we  shall 
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attempt  to  reference  some  of  the  more  relevant  results,  particularly  those 
with  which  some  of  our  readers  may  not  be  aware. 

An  interesting  treatment  of  the  least-squares  Prony  method,  in  which 
the  polynomial  of  the  system  poles  is  expressed  directly  as  a  determinant, 
is  given  by  Ellington  et  al.  [20] .  Demonstrations  of  the  noise  sensitivity 
problem  have  been  presented  by  Hildebrand  [14]  and  some  analytical  work 
along  those  lines  has  been  done  by  Dudley  [21].  A  very  interesting  study  of 
the  behavior  of  "noise  poles"  as  a  function  of  signal  to  noise  ratio  has 
been  recently  published  by  Kay  [17],  including  a  theoretical  analysis  that 
may  be  considered  applicable  to  our  problem  when  the  observation  interval  is 
infinite  with  respect  to  the  noise  but  finite  with  respect  to  survival  of 
the  signal,  and  an  "autocorrelation"  approach  is  appropriate.  Accuracy  of 
time  domain  and  spectral  domain  reconstructions  of  noisy  signals  using 
Prony-type  estimates  have  been  studied  by  Spitznogle  and  Quasi  [22]  and 
Beatty  and  George  [1],  and  the  latter  paper  provides  a  rare  look  at  the  use 
of  decimated  data  sequences. 

The  use  of  the  SVD  decomposition  has  been  explored  by  Holt  and  Antill 
[23],  but  peculiarly  enough  they  have  applied  it  only  to  the  non-homoge neous 
version  of  the  least-squares  Prony  solution,  in  which  the  problem  becomes 
poorly  conditioned  if  extra  wide  matrices  are  used.  Indeed,  they  use  the 
SVD  to  "recondition"  the  problem,  rather  than  as  a  direct  solution;  thus 
their  adjusted  matrix  must  still  be  inverted.  Earlier  approaches  using 
Householder  triangular ization  have  been  reported  by  others,  in  particular 
Van  Blaricum  and  Mitra  [11]. 

Price  [28]  has  approached  the  problem  from  an  eigenvector  viewpoint, 
although  without  reference  to  SVD.  Moreover,  he  has  addressed  the  problem 
of  forcing  known  poles  by  a  transformation  of  the  space  rather  than  through 
augmentation  of  the  data  matrix  as  we  have  done. 
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An  apparently  different  approach  to  the  problem  of  exponential  repre¬ 
sentations  of  signals  has  been  used  by  Jain  with  proven  success  [7,24],  but 
our  derivation  and  extension  has  demonstrated  its  kinship  with  other  methods 
Applications  of  adaptive  filters  inspired  by  Jain's  method  have  been  done  by 
Auton  [25],  Modification  of  the  Prony  method  to  represent  a  set  of  several 
waveforms  with  a  common  pole  set  has  rarely  been  mentioned  in  the  literature 
although  it  was  utilized  by  Young  and  Huggins  [26]. 

There  remain  many  unanswered  questions  of  which  we  mention  only  a  few: 
Does  the  advantage  of  using  extra-wide  data  matrices  persist  if  one  uses  the 
methods  for  solving  Ax=0  that  are  emphasized  in  our  paper  (i.e.,  homogeneous 

a 

methods,  x[norm],  adjoint  solution)?  Or  when  one  uses  a  non-Hankel  matrix? 
How  cost-effective  is  the  use  of  the  unbiased  version  of  the  adjoint  solu¬ 
tion,  when  applicable?  Following  our  derivation  of  Jain's  method,  what  hap¬ 
pens  when  one  uses  different  transformations  of  the  z  variable?  Would  that 
approach  lead  to  other,  perhaps  superior,  prefilters? 
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Figure  1.  Numbering  pattern 
for  an  8><6  Hankel  matrix. 
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Figure  2.  Successive  states  in  the  algorithm 
for  suppressing  extraneous  poles. 
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APPENDIX 

(THEOREM  PROOFS  AND  PROOF  OUTLINES) 


THEOREM 


§1:  If  P  =  a'Fa  then  P.  .  =  £  (Ai)  .  (A2^v  so  that  p-  •  and  p  are 
1  2  it  -  ki  Jen  ii  nun 


statistically  independent  unless  i  =  m  or  j  =  n.  Each  element  of  adj (P)  is 
a  determinant  of  a  "minor"  submatrix  of  P  with  a  row  and  column  deleted. 

But  the  determinant  of  any  matrix  may  be  defined  [27]  as  the  sum  of  all 
possible  (appropriately  signed)  products  of  elements  of  the  matrix,  wherein 
each  product  has  exactly  one  representative  of  each  column  and  each  row  of 
the  matrix;  i.e.,  no  such  product  contains  more  than  one  element  from  either 
a  single  row  or  a  single  column.  Since  the  latter  property  applies  even  to 
each  minor  submatrix  of  P,  it  follows  that  the  expectation  operator  distrib¬ 
utes  on  each  element  of  adj[P]  first  over  the  sum  (due  to  linearity)  and 

then  onto  the  members  within  each  product  (since  P  .  and  P  have  the  statis- 

13  inn 

tical  independence  property  described  above) ,  and  ultimately  distributes  to 
the  individual  elements  of  A^  and  A2 ,  thus  proving  the  theorem. 


THEOREM  §2:  In  the  first  version  of  the  theorem  Jie  unknown  parameter  A 

determines  the  joint  probability  density  of  the  data  matrix  A,  and  due  to  the 

Gaussian  assumption  the  log-likelihood  function  is  L  -  -(2c  )  | | A— A | |  - 

12 

—  MN  log  (2irc  ).  The  maximum  likelihood  choice  for  A,  given  A,  is  that 
which  maximizes  L  within  the  known  set  of  possible  A's  (in  this  case  the  set 

A 

of  M*N  matrices  of  rank  K) .  Clearly  A(K;SVD)  is  that  choice,  since  it  mini¬ 
mizes  the  Frobenius  norm.  If  K  is  also  an  unknown  parameter,  known  only  to 

A 

be  less  than  N  (where  N<M) ,  then  L  is  clearly  maximized  by  using  A(K;SVD) 
with  K  =  N— 1,  since  A  can  always  be  approximated  as  well  by  matrices  of 
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larger  and  larger  rank.  If  some  probability  distribution  for  K  is  known 

a  priori,  say  P  ,  then  the  traditional  modification  to  the  maximum  likelihood 

procedure  is  to  choose  A  to  maximize  the  weighted  likelihood,  which  gives  the 

log-likelihood  function  L  =  -(2a  )  |  |  A— A I  I  +  log  P  -  —  MN  log  (2Tra  ). 

F  K.  Z 

A 

Since  A(K;SVD)  then  maximizes  L  for  each  particular  K,  the  optimum  K  is  that 

which  minimizes  | |A-A(K;SVD) |  |  -  2a  log  P  .  If  one  seeks  to  estimate  $(A) 

F  K 

rather  than  A,  the  situation  is  more  complicated.  The  nullspace,  <(>(A),  is  a 
peculiar  sort  of  parameter,  and  is  not  sufficient  to  determine  the  proba¬ 
bility  distribution  of  A.  It  does  however  determine  a  family  of  possible 
distributions,  and  thus  a  family  of  log  likelihood  functions  L  = 

-  (2a  )  | |a-a| |  +  const.,  parametrized  by  the  matrix  A  (conpatible  with 

A 

the  specified  nullspace) .  Clearly  setting  A  =  A(K;SVD)  achieves  a  global 

A  A 

maximum  of  L  for  matrices  of  rank  K,  and  taking  <p  (A)  *  tp  (A)  then  gives  a 
nullspace  estimate  whose  family  of  likelihood  functions  includes  one  that 
achieves  the  maximum.  In  another  sense,  estimating  4> (A)  is  a  partial  esti¬ 
mation  problem  in  which  additional  parameters  (i.e.,  the  elements  of  A)  must 
be  estimated  incidentally. 

THEOREM  §3:  To  prove  this  theorem  we  first  develop  several  lemmas. 

4-V>  A  A 

Lemma  Is  For  any  MxN  matrix  B  and  its  K^-rank  approximant  C  =  B(K;SVD) 
where  K  <  rank(B) ,  then  for  any  N*1  vector  x, 

(a)  I  I Bxc | |  >  svk(B)  | | xc | |  ,  and 

(b)  | | Bx- | |  <  svr+1(B)  | | x— | |  . 

T 

Proof:  Let  the  SVD  of  B  be  given  as  B  *  USV  and  let  S  and  V  be  partitioned 

as  S  «  [S^jSq],  V  *  [V^jVq]  where  in  each  case  the  leftmost  K  columns  are 

T  T 

isolated.  Then  B  ■  U(S^V^  +  S^V^) ,  and  the  diagonal  elements  of  are  the 
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first  K  singular  values  of  B,  thus  non- zero.  The  approximant  C  may  be 

obtained  by  artificially  setting  SQ  to  zero,  which  means  that  <j>(C)  is  simply 

the  space  orthogonal  to  the  columns  of  V^,  i.e.,  spanned  by  the  columns  of 
T  T 

VQ.  Thus  V^x-  =  0  =  VQXC*  T*ien  since  u  and  v  are  orthogonal  transformation 
matrices, 

I  lBxcl  I  =  I  10(^1  +  =  HS1V1XC|I  2  SVK(S1)  lKxcM 

or  since  consists  of  orthonormal  columns, 

I  KM  *  svk(b)  UxcM  • 

Similarly 

llBXgll  -  HuiSjV*  .  S0vJ)x5||  -  ||S0vJxj||  <  SVjIS^  ||vjx5|| 
or  singly 

| |Bx-| |  <  svk+1(B)  | |x-| |  . 

Lemma  2:  For  any  M*N  matrix  A  of  rank  K,  and  any  N*1  vector  x, 

llxAll  2  MaxaM/svk(A) 

Proof:  Use  Lemma  1(a)  with  B  =  A  so  that  C  =  A  also;  solve  for  | |xft| |. 

Lemma  3:  With  same  hypothesis  as  Lemma  2,  and  another  N*M  matrix  E,  where 
rank (A)  *  rank(A+EA> „  then  |  |XAI I  <  M  (A+EA> I  |/svK(A+EA) . 

Proof:  Apply  Lenina  2  with  A  replaced  by  A+E  ,  noting  that  x  =  x.  .  . 

A  A  (A+Em ) 

A 

A  A 

Lemma  4:  If  x  is  a  unit  vector  lying  in  the  nullspace  of  A(K;SVD) ,  where 
A  ■  A+E  and  A  is  an  M*N  matrix  of  rank  K,  then 

ll&ll  S  svk+1(A)  -  I  lEl  I 

fa  **  A  A  A  A 

Proof:  Apply  Lemma  1(b)  with  B  *  A  and  x  =  x.  Then  by  assumption  x-  *  x 
and  | |x-||  «  1.  To  get  the  rightmost  inequality  we  use  a  fundamental 
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property  of  singular  values  [9] : 

svr+1(A)  =  svk+1(A+E)  <  svk+1<a)  +  =  0  +  |  |E|  |  . 

Lemna  5:  Same  as  Lenma  4,  ending  with  {  {e—  {  |  instead  of  ||e||. 

Proof:  As  before,  but  use 

■»*+!<*>  -  svKtl(A«s«j)  S  svKtl(A+EA)  ♦  SVjIEj)  -  0  *  ||Ej|| 

Lenma  6:  with  A, E  as  before  and  x  any  unit  vector,  then 

(a)  UaxJI  <  |  |Ax|  |  +  ||E||  ,  and 

(b)  ||(a+ea)xa||  <  | | Ax | |  +  | |e-| 1 

Proof:  Ax,  =  Ax  =  (A-E)x  =  Ax  -  Ex  ,  so 
A 

||axa||  <  ||Ax||  +  ||Ex||  <  | |Ax||  +  1 |e l I  since  | \x\ |  =  1  • 

Part  (b)  results  similarly  upon  noting  that 
(A+EA)xA  -  (A+Ea)x  *  (A-E-)x  . 

Lemma  7:  If  a  and  B  are  positive  and  a  <  B,  then  a+B  <  2B  <  2(a+B). 

Proof:  Obvious. 

The  theorem  follows  immediately,  defining  ^  =  {||ax[|  +  | |e| | }/svR(A) 

and  b  =  2 1 |e| |/sv  (A)  and  using  Lemmas  2,  6(a),  4,  and  7  in  sequence;  and 
2  K 

defining  b3  =  { { (ax| 1  +  l |e- | | }/svK(A+EA)  and  b4  =  2 | |e- j |/svK(A+EA)  before 
applying  Lemmas  3,  6(b),  5  and  7. 

THEOREM  §5:  Since  both  ends  of  the  8  vector  are  non-zero  by  assumption,  no 
row  of  G  can  be  expressed  as  a  linear  combination  of  the  rows  above,  and  all 
the  rows  are  therefore  linearly  independent  by  induction.  To  prove  unique¬ 
ness  of  the  generator,  suppose  that  there  are  two  matrices  and  G^  as  in 
Eq.  (11)  that  span  the  same  S^;  i.e.,  they  have  the  same  rowspace.  Then  by 
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construction  the  top  row  of  G2  is  non-zero  only  for  its  first  K'  elements, 
and  it  can  be  expressed  as  a  linear  combination  of  the  rows  of  .  But  that 
combination  cannot  include  the  last  row  of  G^  since  it  would  produce  a  non¬ 
zero  rightmost  element.  Similarly  each  higher  row  of  G^  can  be  ruled  out  so 
that  the  top  row  of  G2  is  a  scalar  multiple  of  that  of  G^;  i.e.,  the  genera¬ 
tor  is  unique  to  within  a  scalar  multiple. 


THEOREM  §6:  Although  the  dimension  of  the  power  vector  Z  has  always  been 

inferred  from  the  context  of  its  use,  for  the  proof  of  this  theorem  we  shall 

promote  clarity  by  appending  the  dimension  as  a  subscript  in  parentheses, 

T 

e.g.  Z,„.  .  Thus  if  z  is  a  root  of  the  generator  polynomial  then  0  Z,,,.  =  0, 
(N)  (K) 

which  clearly  implies  GZ.„.  =  0,  where  G  is  the  matrix  G(£;0)  of  Eq.  (11). 

(N) 

T 

But  xeO^O)  implies  x  lies  in  the  rowspace  of  G,  i.e.,  x  =  G  a  for  some  £xi 

T  A  T  T  T 

vector  a  =  [a-.a.. , ...]  .  Then  it  is  easily  shown  that  x  Z,„.  =  a  GZ,„.  = 

U  1  IN)  (N) 

T  T 

(a  x  (0  z^Rlj).  Clearly  the  roots  of  the  generator  polynomial  are  a 

T 

subset  of  those  of  x  Z.„. ,  and  the  l-l  extraneous  roots  can  be  placed  at  will 

(N) 

by  selecting  the  elements  of  a  as  the  coefficients  of  a  polynomial  whose 
roots  are  the  desired  extraneous  set.  Moreover  since  the  polynomial  coeffi¬ 
cients  of  all  three  polynomials  are  real,  all  three  root  sets  contain  only 
conjugate  pairs. 


THEOREM  §7:  The  N-dimensional  power  vectors  of  the  (distinct)  roots  z^  of 
the  generator  polynomial  constitute  a  set  of  K  linearly  independent  vectors 
since  they  can  be  juxtaposed  to  form  columns  of  a  matrix  for  which  the  upper 
KxK  submatrix  is  Vandermonde,  and  therefore  nonsingular.  Since  each  such 
power  vector  obviously  lies  within  the  nullspace  of  the  G  matrix  of  Eq.  (11) 
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N 

so  that  the  set  spans  that  nullspace,  the  K-dimensional  subspace  of  R  that 
is  the  orthogonal  complement  is  identically  £2^(0),  the  rowspace  of  G. 

THEOREM  §9:  Obviously  y  J_  if  and  only  if  Gy  =  0,  where  G  is  the  matrix 

of  Eq.  (11),  and  the  theorem  follows  immediately. 

T 

THEOREM  ill:  Clearly  if  A  is  9-AR  then  AG  =0  where  G  is  the  matrix  of 

Eq.  (11).  But  AG^=  0  implies  [29]  rank(A)  <  N  -  rank(G*)  =  N-4  =  K.  However 
T  T 

xen^(0)iff  x  =  G  a  for  some  a,  but  then  Ax  =  AG  a  =  0,  so  xe$(A) .  If 
rank (A)  =  K  then  $(A)  is  a  subspace  of  dimension  N-K  =  4,  the  same  as  the 
dimension  of  £2  (9)  ,  so  in  that  case  i|> (A)  and  £2.(0)  are  identical.  Thus 

A*  X» 

T 

Ax=0  implies  xe£2  (0)  and  therefore  the  roots  of  x  Z  constitute  a  superset  of 

X* 

the  roots  of  the  generator  polynomial  by  Theorem  §6. 

THEOREM  §14:  Since  both  ends  of  0  are  non-zero,  at  least  one  member  of  any 
btsis  of  ( 0 )  must  have  its  first  element  non-zero,  and  the  scaling  described 
in  the  theorem  statement  clearly  does  no  harm.  Furthermore  {g^ja^ ,a3, . . .} 
is  still  a  basis  since  g^  could  be  added  to  the  a's  to  recover  the  original 
basis.  Now  if  x  is  an  arbitrary  vector  in  £2  (9)  and  we  augment  it  by 

attaching  an  initial  zero  element  to  form  a  vector  x,  then  it  can  be  expressed 
as  a  linear  combination  of  the  bottom  l-l  rows  of  G(£;0),  hence  as  a  linear 
combination  of  the  basis  vectors  {g^ia^ja^, . . .) .  Indeed  g^  can  obviously  be 
ruled  out  of  the  combination,  and  it  follows  easily  that  x  lies  in  the  space 
spanned  by  {a2 ,a3, . . . ,a^}  as  defined  in  the  theorem.  Since  that  set  is  of 
dimension  4-1,  it  must  constitute  a  basis  of  £2^  ^(9).  The  corollary  follows 
directly. 
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THEOREM  §16:  Perturbations  in  the  (assumed  distinct)  roots  of  the  polynomial 
T 

x  Z  due  to  perturbations  in  the  x  vector  may  be  determined  to  first  order  by 

setting  the  total  differential  of  the  polynomial  to  zero  at  each  root  loca- 

T  T  T 

tion:  (x+<5x)  Z  +  x  (Z+6Z)  =  0  at  Z  =  Z.  .  But  since  x  Z.  =  0  we  have 

T  T  T  -1 

(6x)  Z.  =  -  x  <5Z .  =  -  x  DZ.z.  dz. 

i  i  ill 

where  the  diagonal  matrix  is  as  defined  in  the  theorem.  Solving  for  dz^/z^ 
and  noting  that  x  =  x+<5x  to  first  order  gives 

dz^/z^  =  -  (Sx)TZ^/(x+<$x)TDZ^  . 

For  the  purpose  of  the  theorem  we  set  the  nominal  value,  x,  as  the  "correct" 

A  A  A 

portion  of  the  approximate  solution  x,  i.e.,  x  =  x-.  Then  clearly  5x  =  xft, 
and  x+6x  =  x  so  the  theorem  follows  immediately. 

THEOREM  §19:  The  theorem  follows  directly  from  the  fact  that  A  A  is  nonsin- 

~T~  ~T~ 

gular,  so  that  premultiplying  the  equation  by  A  A  gives  det(A  A) xQ  on  the 

~T~ 

left-hand  side,  and  A  A  can  be  partitioned  as 


The  rest  is  simple  algebra. 

THEOREM  §27:  The  least-squares  Prony  solution  of  Eq.  (13)  can  be  expressed 

T  T  T  T  — •  l 

as  y  *  [y+;l]  where  y+  is  a  Kxl  vector  defined  as  y+  =  -  (A+A+)  A+a,  and 

T 

the  system  poles  are  the  roots  of  the  polynomial  y  Z.  But  the  roots  of  any 
such  normalized  polynomial  are  identically  the  eigenvalues  of  the  companion 
matrix  [30]: 


,  where  I  is  a  (K-l)x(K-l)  identity 
T 

by  A+A+  and  carefully  carrying  out  the 
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